At least some embodiments disclosed herein relate to scheduling instructions of a computer program for execution in a computing device in general and more particularly, but not limited to, scheduling the instructions for parallel execution in multiple circuit tiles of the computing device.
Traditionally, assembly language programming is based on specifying operations to be performed on data stored in registers. A typical opcode is specified to identify an operation to be performed on data stored in one or more registers identified for the opcode; and the result of the operation is to be stored in a register identified for the opcode.
To execute such a traditional assembly language program, virtual registers referenced in the program are mapped to physical registers in a processor for execution of the program. When there are fewer physical registers than the virtual registers referenced in the program, values are shifted around among the physical registers to implement register reuse and satisfy the virtual register usages in the program.
An artificial neural network (ANN) uses a network of neurons to process inputs to the network and to generate outputs from the network.
Reinforcement learning (RL) is a machine learning technique designed to train a computer agent to determine desirable actions through trial and error. For example, the agent can be implemented as a model of policies to select, based on inputs, an action from candidates according to an artificial neural network (ANN). The action responsive to the inputs can generate a reward; and the reward can be used to train the agent to maximize accumulative rewards.
The embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements.
At least some embodiments disclosed herein provide techniques of configuring a coarse grained reconfigurable array to run an assembly language program specifying data flows through memory locations represented by memory variables.
Compute near memory (CNM) architecture can be used to leverage the dramatic opportunities provided by high performance communication protocols, such as the compute express link (CXL) protocol. Such compute near memory (CNM) architecture can incorporate heterogenous compute elements in a memory/storage subsystem to accelerate various computing tasks near data. An example of such compute elements is a streaming engine (SE) implemented via a coarse grained reconfigurable array (CGRA) having interconnected computing tiles. The tiles are interconnected with both a synchronous fabric (SF) and an asynchronous fabric (AF). The synchronous fabric (SF) can be configured to connect each tile with neighboring tiles that are one or two clock cycles away. The synchronous fabric (SF) interconnects elements within each tile, such as tile memory, multiplexers, and single instruction multiple data (SIMD) units, etc. Tiles can be pipelined through synchronous fabric (SF) to form a synchronous data flow (SDF) through the single instruction multiple data (SIMD) units for operations such as multiply/shift, add/logical operations, etc. Each tile can have a pipelined time-multiplexed processing unit such that a new instruction can start on each tile at every clock cycle. The asynchronous fabric (AF) connects a tile with other tiles, a dispatch interface (DI), and memory interfaces (MIs). The asynchronous fabric (AF) bridges synchronous data flows (SDF) through asynchronous operations, which include initiation of synchronous data flow, asynchronous data transfer from one synchronous data flow to another, system memory accesses, and branching and looping constructs. Each tile can have a delay register to hold its output for outputting with timing alignment with execution of an instruction that uses the output. Together, the synchronous fabric (SF) and asynchronous fabric (AF) allow the tiles to efficiently execute high-level programming language constructs. Simulation results of hand-crafted streaming engine (SE) kernels have shown orders-of-magnitude better performance per watt on data-intensive applications than existing computing platforms.
However, it is challenging to apply traditional compilation tools to program operations of a new architecture, such as streaming engine (SE) implemented using a coarse grained reconfigurable array (CGRA). In a dataflow based coarse grained reconfigurable array (CGRA), a program works by flowing data from one tile to another in a synchronous fashion. This requires instructions to be programmed at an exact cycle on the correct tile to avoid corrupting the synchronous flow of operations. Instead of morphing a dataflow to pretend it is a sequence of register transfers as in traditional assembly, at least some embodiments discussed in the present disclosure use a new assembly language with a corresponding parser that enables describing a program as a group of graphs that represent the data flows.
Configuring a streaming engine (SE) requires finding a synchronous schedule of instructions such that a flow can start for a data element and have every subsequent instruction line-up on a valid tile on the correct cycle. The assembly language of at least some embodiments discussed in the present disclosure is advantageous in the determination of such a synchronous schedule. It can be used to describe some of the configuration details of the hardware as well as the data flow of the computation.
In one embodiment, the assembly language is configured to describe the details of a program for a streaming engine (SE). For example, a dispatch interface (DI) block of the program can be configured to specify information about the dispatch interface of the streaming engine (SE); a memory interface (MI) block can be configured to specify information about memory operations implemented via memory interfaces of the streaming engine (SE); a tile memory (TM) block can be configured to specify information about memory variables to be mapped to tile memories of the streaming engine (SE); and a flows block can be configured to specify a group of graphs representative of the synchronous data flows.
Optionally, a user describes the computation to be performed by a streaming engine (SE) in terms of configuration details specified using the dispatch interface (DI) block, the memory interface (MI) block, and tile memory (TM) block, and the program details via the flows block. Such an assembly language program can be parsed, mapped, and lowered by a software tool into a program execution configuration of the streaming engine in running the assembly language program.
Optionally, a compiler can be used to automate the conversion of a computer program written in a high-level programming language to the assembly language program according to the present disclosure.
The disclosed techniques of assembly language programs have various advantages. For example, representing configuration and data flow allows the assembly to reflect the device state. For example, programming data flows allows a programmer to work in terms of how data is moving between operations instead of how to schedule the hardware details between tiles. For example, breaking code into separate synchronous flows allows the programmer to explicitly define the asynchronous messaging that happens between synchronous elements. For example, programming the device at the abstract representation of assembly language is much faster than working at the low-level details of specifying operations of the multiplexers and tile connections. For example, a parser can provide friendlier error messaging for typos and inconsistent logic instead of debugging why the device simulation didn't terminate or provided incorrect answers. For example, defining an assembly language opens future possibilities of leveraging mainstream compiler tools to compile high level code down to such a more abstract description of the device. For example, since programs are lists of instructions, high and low-level knobs can be provided to the programmer through instruction representation. For example, a low-level type of instruction allows the programmer to specify individual fields/opcodes that end up in the instruction; or, a high-level format in terms of operations instead of fields can be used.
An assembly language program describing data flows can be mapped for execution on a specific coarse grained reconfigurable array (CGRA). The coarse grained reconfigurable array (CGRA) can have a particular structure, e.g., a number of tiles and memory interfaces, and particular inter-connectivity of synchronous fabric (SF) and/or asynchronous fabric (AF) among the tiles. Such a particular structure can be specific to the coarse grained reconfigurable array (CGRA) that is to be used in execution of the program and thus not reflected in the assembly language program. On the other hand, the assembly language program is shielded from such details and thus can be mapped for execution on different coarse grained reconfigurable array (CGRA) having different structural details.
A scheduler can map the instructions of the assembly language program for execution in tiles of a coarse grained reconfigurable array (CGRA). Since each tile can have a pipelined time-multiplexed processing unit, a new instruction can start on each tile at every clock cycle. Thus, the scheduler can generate a schedule specifying which instruction is programmed on which tile for execution at which clock cycle. The scheduler can determine the tiles and clock cycles of the instructions being mapped in a correct combination such that the data flows in the coarse grained reconfigurable array (CGRA) propagate with proper timing. For example, outputs of tiles are produced at proper clock cycles to be provided in time, through the synchronous fabric (SF), and/or the asynchronous fabric (AF), as inputs for further processing in the tiles. As the instructions are mapped to the tiles, the memory variables used by the instructions are also mapped to memories in the tiles.
Based on the schedule of the instructions for execution in the tiles, a software tool (e.g., a lowering program) can be configured to generate an execution configuration of the coarse grained reconfigurable array for running the assembly language program. The software tool can determine the details on how to configure each connection between tiles. The software tool can determine the low-level details of dividing the tile memory into regions to implement the memory variables mapped to the tiles. The software tool can determine the settings of the correct multiplexer bits in the tiles to ensure data flows correctly at the correct clock cycle within the tiles. The entire program can break/corrupt for having even one missing bit. The details determined by the software tool can be specified in the execution configuration to control the execution of the assembly language program in the coarse grained reconfigurable array (CGRA).
For example, according to the assembly language program and the schedule, the software tool can walk the dataflow graph to trace which operations will be the master control of the successor operations. As it traces the graph, it can set the outgoing control for the current tile operation; and as it traverses to a child, it can set the incoming control information on the cycle it arrives. Control settings can also be determined and set for data passing through routes used on the tiles as well as delay registers.
The software tool can also use the dispatch interface information and the memory interface information provided in the assembly language program to configure operations of the dispatch interface and memory interfaces of the coarse grained reconfigurable array (CGRA). The assembly language program specifies high-level details about the messaging generated by the dispatch interface and memory interfaces. Using the schedule the software tool can identify the messaging in terms of physical hardware locations in the coarse grained reconfigurable array (CGRA).
The software tool has various advantages. Manual generating the execution configuration of a coarse grained reconfigurable array (CGRA) is a monotonous, laborious, error-prone process that can take dozens of man-hours for even simple problems. The software tool automates the work to allow easy verification of hardware constraints. If hardware timing details change, the software tool can rerun with changes in parameters to generate a new execution configuration. The design of such a software tool configured to receive the schedule generated by a scheduler as an input allows the offload many hardware details out of the scheduler, such that implementation of different mapping strategies for the scheduler can focus on instruction placement.
To schedule instructions for execution on tiles of a coarse grained reconfigurable array (CGRA), it is possible to use a brute force approach to explore all of the possible choices of instructions placement and scheduling, and then select a best performing schedule. However, there are a huge number of possible choices in the search space, resulting from a combinatorial explosion of choices.
For example, a delay register of a tile can be used to implement a selected number of delays to synchronize output timing and input timing. As a result, the delay register introduces a number of possible choices that can be multiplied by other choices to increase the possible choices. For example, a tile can have multiple tile memories available to implement a memory variable; and the memory variable can be implemented in one of the tiles of the coarse grained reconfigurable array (CGRA). Thus, there are many possible choices for the implementation of one memory variable; and the possible implementations of a number of memory variables can increase dramatically as the number of memory variables increases. The combination of implementing which variables on which portion of which tile memory of which tiles and scheduling which instruction for which execution on which tile at which clock cycle can lead to a huge search space.
In one embodiment, the search space to be explored by a brute force approach is reduced by performing selections prior to searches. The selections reduce the search space and improve the efficiency in obtaining a valid schedule.
For example, a scheduler can be configured to determine, before starting the brute force search, an allocation of memory variables to tile memories subject to some constraints. The allocation represents a distribution of memory variables of the program to tiles for implementation using tile memories of the respective tiles. For example, such an allocation can be performed with an aim to balance the number of instructions/variables per tile. Further, certain hardware details can be considered in the determination of the allocation (e.g., placing neighbors in data flow on tiles close to each other). Determining the allocation before scheduling instructions can reduce the combinatorial number of choices to be explored by the brute force search.
For example, an instruction can be placed in the tile in which memory variables used by the instruction are implemented. Thus, determining a memory allocation prior to a search can reduce the choices in scheduling an instruction; and the instruction can be scheduled without having to explore possible choices associated with other tiles.
The scheduling of instructions according to one allocation of memory variables to tile memories can be performed independent on the scheduling of instructions according to another allocation. Thus, parallel searches can be performed based on different memory allocations. A resulting schedule having the best score (e.g., based on performance in latency, and/or power, etc.) can be used.
In one embodiment, a valid schedule is to satisfy certain constraints. For example, instructions that share one or more tile memory variables should be placed on a same tile; no instructions each starting a synchronous flow may be placed on a same tile; and/or no multiple sibling instructions may be placed on a same tile, etc. The constraints can be considered by the scheduler in making selections that reduce choices to be exploited using a brute force approach.
For example, a scheduler can partition the instructions of a program into a target number of instruction groups. The target number can be equal to, or more than the number of synchronous data flows specified in the program. Each of the instruction groups is selected to be scheduled on a tile; and the instruction groups are selected to meet the constraints to be satisfied by a valid schedule.
Further, the partitioning of the instructions of the program into instruction groups can be performed to satisfy additional requirements. For example, the partitioning of the instructions of the program can be performed to balance the instruction groups to have a similar number of instructions per group, and/or balance memory usages of the instruction groups to have a similar total tile memory utilization per group, etc. Further, within each instruction group, tile memory variables are distributed to tile memory region(s) without exceeding tile memory region size, without creating a tile memory access conflict for any instruction, etc.
After the partitioning of the instructions of the program into groups, the scheduler can schedule the instructions of each group and their corresponding tile memories on a tile of the coarse grained reconfigurable array (CGRA) to generate a schedule.
In general, a program can be partitioned in different ways into different sets of instruction groups. Scheduling different sets of instruction groups can be performed in parallel to generate different schedules. The performance scores of the resulting schedules can be evaluated (e.g., based latency, and/or energy consumption, etc.) to select a best performing schedule as the output.
In some embodiments, a tile of a coarse grained reconfigurable array (CGRA) can have multiple instruction slots for pipelined execution. To schedule an instruction in a tile, the scheduler determines a slot of the tile to schedule the instruction for execution. The schedule of the instruction is selected to have valid timing and slot configurations for the instruction, the prior instructions that have been scheduled before the instruction, and the subsequent instructions that are scheduled after the instruction.
For example, the scheduler can be configured to perform the operation of scheduling one instruction recursively. For a current instruction selected for scheduling, the scheduler can search for parameters (e.g., slot and/or clock cycle) of the schedule of the current instruction in order to produce a valid schedule for the combination of the current instruction and the prior instructions that have been scheduled before the current instruction. If a scheduler finds a schedule that is valid in timing and other constraints for the current instruction and the prior instructions, if any, that have been scheduled before the current instruction, the scheduler proceeds to select a next instruction from the remaining instructions to be scheduled, and then processes the next instruction as a current instruction selected for scheduling, until there is no remaining instruction to be scheduled.
However, if the scheduler determines that there is no valid schedule for the current instruction in view of the prior instructions having been scheduled before the current instruction, the particular schedule of the prior instructions that have been scheduled before the current instruction is invalid. The scheduler can then move back to the previous instruction scheduled before the current instruction and process the previous instruction as a current instruction to be rescheduled. The process can continue until a valid schedule is found for the instruction groups, or it is determined that no valid schedule can be found for the instruction groups.
Using the techniques, the scheduler can produce, within a sensible amount of time, at least one valid schedule for running a program of data flows in a coarse grained reconfigurable array (CGRA). The core function of the scheduler is highly efficient in terms of time complexity because, upon a failure in scheduling, it immediately terminates the call, recovers the previously achieved schedule for a subset of instructions of the program and from it, and continues the search for a new valid schedule for more instructions.
At least some embodiments disclosed herein include techniques of reinforcement learning to train an artificial neural network (ANN) to identify a placement of an instruction in a computing device having multiple parallel circuits for instruction execution. An example of such a computing device is a streaming engine implemented on a coarse grained reconfigurable array (CGRA) having multiple tiles. A scheduler receives an input identifying the instruction among instructions of a program, execution dependency conditions of the instructions of the program, and placements of a portion of the instructions of the program in circuit units of the computing device.
Instructions of a typical program have dependency in execution. For example, execution results generated by some instructions can be used in the program as inputs for the execution of other instructions in the program. A distribution of instructions to slots of the tiles of a coarse grained reconfigurable array (CGRA) for execution is valid when the instructions can be scheduled at proper cycles to ensure correct execution and dataflow. For example, when the instructions executed in the slots in the tiles are scheduled at certain clock cycles, inputs required for initiation of instructions in the slots of the tiles should be available in time for the execution initiation of the instructions.
In general, there are different, valid ways to schedule the instructions for execution in slots of the tiles; and the different schedules can have different performance levels in running the program in the coarse grained reconfigurable array (CGRA). For example, the performance level of a schedule can be evaluated based on the number of clock cycles required to run the program according to the schedule in the coarse grained reconfigurable array (CGRA). Although it is possible to use a brute force algorithm to test all possible schedules to find the best performing schedule, such an approach is inefficient.
In one embodiment, a reinforcement learning technique is used to train an actor model of artificial neural network (ANN) in deciding a placement of an instruction of a program in a slot of a tile in a computing device. The placement of the instruction is determined based on the placement of one or more other instructions of the program that have been scheduled/placed before the instruction. Since the placement of the instruction corresponds to selection of an option from a set of discrete options of placements, the problem can be formulated as a discrete action problem solved via artificial neural network (ANN) trained via reinforcement learning.
For example, the technique of proximal policy optimization (PPO) for reinforcement learning (RL) can be used to train a neural network model to place instructions of a program in the tiles based on a reward function. The reward function can be configured to model the coarse grained reconfigurable array (CGRA) and its constraints. To use the proximal policy optimization (PPO), samples of rewards and placement actions for the training of the actor model of artificial neural network (ANN) can be collected by running inference on the latest copy of actor model and obtaining the outcome from the reward function. The samples can be stored in a buffer for use as training data.
In one embodiment, the actor model is configured to receive, as an input, a state implemented as a concatenation of an array of placed nodes representing the placement of a portion of instructions of the program in slots of tiles of the coarse grained reconfigurable array (CGRA). The actor model further receives, as inputs, an identification of an instruction to be placed next, and a representation of a computation graph specifying the execution dependency conditions in the program. Based on the received inputs, the actor model is to generate an action indicating a tile and a slot in the tile for the placement of the instruction for execution.
The reward for the actor model can be configured based on the number of cycles taken to execute instructions. After determination of an action of placing an instruction in a slot of a tile, the corresponding inputs to the actor model, the action, and a corresponding reward for the action can be saved as a sample in the buffer. Optionally, samples in the buffer can selectively be chosen to keep or discard certain number unsuccessful placements samples to balance numbers of successful and unsuccessful samples for the training phase. In the training phase, proximal policy optimization (PPO) can be used to train/adjust the actor model to produce actions from the sample states and the critic model to match the sampled rewards from the actions produced by the actor using a surrogate loss function. The process of sampling and training can be repeated for improved capability of the actor model in predicting placements to maximize reward and performance.
In scheduling the instructions of a program, memory variables of the program to be operated upon by instructions should be implemented on the tiles in which the instructions are executed. Such a memory constraint can be captured using a memory dependency array as part of the computation graph. The actor model can include a graph neural network (GNN) receiving, as an input, the computation graph of the instructions to be performed. In the computation graph, each node represents an instruction and contains features that are a concatenation of tile memory dependency array and positional sinusoidal encoding. The graph neural network (GNN) model is configured to produce an embedding that is combined with the state observation and encoding of the next instruction to be placed as a node. An attention module is added to the embedding to highlight important info to the actor model and the critic model. After placement of instructions, routing information and configurations for programming each tile can be saved as a final output.
For example, the actor model can be a feed-forward model or a sequential model such as a transformer encoder block.
The actor model trained via reinforcement learning can reduce the usability barrier of coarse grained reconfigurable array (CGRA). A user doesn't need to be an expert in coarse grained reconfigurable array (CGRA). The actor model can provide instruction placement suggestions or tile configuration labels to assist other tools or programmers in the scheduling of a program. The actor model can be used to generate an instruction execution schedule of a similar performance level faster than a brute force approach. The reinforcement learning allows unsupervised learning and optimization to search in a wider search space. Reinforcement learning can learn from a collection of programs being placed and reuse some data for scheduling new programs. Proximal policy optimization with graph embeddings can find better schedules by finding higher rewards than other approaches.
At least some embodiments disclosed herein provide a software tool that can be used to explore a design space of implementing a data flow program on a coarse grained reconfigurable array (CGRA). The design space exploration tool can be configured to search, in the design space, for a solution of instruction execution configuration that meets a set of one or more user specified selection criteria.
For example, the design space exploration tool can be configured to use a toolchain (e.g., a scheduler and a configuration generator) to convert a data flow program into numerous, valid configurations for executing the instructions of the data flow program in a coarse grained reconfigurable array. Further, the design space exploration tool can use a simulator of the coarse grained reconfigurable array to simulate the clock-by-clock execution of instructions and data propagation according to an execution configuration in the coarse grained reconfigurable array. The simulation result can show whether the execution configuration is valid and if so, provide the performance metrics of the execution configuration, such as energy consumption performance, processing speed, etc. of the data flow program implemented according to the execution configuration.
The design space exploration tool can be configured to automate the use of the simulator to simulate and validate (e.g., one after another) the numerous execution configurations generated by the toolchain. Further, the design space exploration tools can be configured to record, in a database, the performance matrices obtained from the simulator for the respective execution configurations to facilitate the selection of a configuration solution that satisfies the user specified selection criteria.
During the design space exploration, the toolchain and the simulator are exercised by the design space exploration tool in implementing the data flow program in different ways. As a result, the likelihood of errors or defects in the toolchain and the simulator being encountered during the design space exploration is greatly increased. Therefore, the validation results of the simulations confirming the validity of the execution configurations generated by the toolchain can be considered the validation of the toolchain and the simulator; and any disagreement between the toolchain and the simulator about an execution configuration can be considered an indication of a potential bug in the toolchain, the simulator, or both. Thus, the design space exploration can also be used for the thorough testing of the toolchain and the simulator at the same time.
The simulator can be configured to accurately enforce the clock cycle restrictions for various operations performed in a coarse grained reconfigurable array. When the simulator can simulate the operations specified according to an execution configuration without violating any clock cycle restrictions in the coarse grained reconfigurable array, the execution configuration is valid. From the simulation, the simulator can determine the computing performance (e.g., a number of clock cycles required to run the data flow program according to an execution configuration) and the energy performance (e.g., an amount of energy consumed in running the data flow program according to the execution configuration).
The design space exploration tool can generate data used in automated comparison of valid execution configurations produced by the toolchain from a data flow program specified in an assembly language adapted for data flow applications. For example, the design space exploration tool can run every execution configuration in the simulator to obtain simulation results. From the performance data generated from the simulations, a most desirable execution configuration can be selected based on a set of criteria specified by a user. For example, the criteria can be specified to select an execution configuration having a highest computing performance level (e.g., by completing the data flow program using a least number of clock cycles), having a most energy efficiency level (e.g., by completing the data flow program with a least amount of energy expenditure in the coarse grained reconfigurable array), having a highest instruction packing density, having a least number of utilized computing tiles of the coarse grained reconfigurable array, having a lowest latency of the instruction flow, or selected according to a combined goal formulated based on such criteria.
The design space exploration can provide additional automated tests for the simulator by the running a potentially large number of different execution configurations. For example, a data flow program can be mapped by a toolchain to a large number (e.g., 10,000 or more) of different, valid execution configurations for running the data flow program in a coarse grained reconfigurable array; and the design space exploration tool can automate the simulation of the different execution configurations. Different execution configurations can use different capabilities in the coarse grained reconfigurable array in different ways for the implementation of the data flow program. For example, the different execution configurations can include the different combinations of the usages of delay registers of different directions, pass-through routes, asynchronous messages, and tile memory writes, etc. to implement the operations of the data flow program in the coarse grained reconfigurable array. Since the simulator is used to confirm the validity of the large numbers of combinations of capabilities found to be valid by the toolchain, the possibility of catching defects in the simulator can be significantly enhanced.
The design space exploration can also provide additional automated tests for the toolchain by using the simulator to test the validity of every execution configuration that is found to be valid by the toolchain. Different execution configurations can use various capabilities of the coarse grained reconfigurable array in different ways for the implementation of the data flow program. For example, the different execution configurations can include the different combinations of the usages of delay registers of different directions, pass-through routes, asynchronous messages, and tile memory writes, etc. to implement the operations of the data flow program in the coarse grained reconfigurable array. Since the validity of the large numbers of combinations of capabilities found to be valid by the toolchain is checked by the simulator, the possibility of catching defects in the toolchain can be significantly enhanced.
In one embodiment, a data flow program is written in a data flow assembly language. Optionally, a compiler is used to convert a data flow program written in a high-level programming language into the data flow assembly language. An assembly toolchain (e.g., a scheduler assembler or toolchain) can be configured to search for different, valid execution configurations mapped to a stream engine (e.g., implemented via a coarse grained reconfigurable array). The toolchain can include a scheduler configured to map memory variables to tiles, and a configuration generator configured to generate an instruction execution configuration for running the assembly language program on a coarse grained reconfigurable array having a hardware profile. Since there are many valid options to map memory variables and to schedule executions of instructions in the tiles of the coarse grained reconfigurable array, the number of valid execution configurations can be non-trivial (e.g., in the order of thousands or hundreds of thousands). The different execution configurations can have unique characteristics in utilization of hardware resources. It is advantageous to test the toolchain and the simulator in handling resource utilization of different characteristics.
In general, different data flow programs can have different design requirements. For example, if an application requires high performance in speed, it can be desirable to use an execution configuration that provides the fastest throughput. However, if energy efficiency is the key criterion, it can be desirable to use the most energy efficient execution configuration. It is also possible that the design interest can be based on a number of performance aspects, such as in finding the most performant execution configuration among a group of execution configurations that are equally energy efficient, or in finding the most energy efficient execution configuration among a group of execution configurations that are equally high performant (or meet a user specified performance criterion).
To determine the performance metrics of the execution configurations (e.g., in speed, energy usage, tile usage, instruction packing density, latency), the design space exploration tool can be configured to automatically load each execution configuration produced by the toolchain as a part of the application test code and run it in the simulator. After the simulation of an execution configuration is complete, the design space exploration tool records the performance metrics obtained from the simulator into a database. Then, the design space exploration tool can automatically launch a next simulation of a next execution configuration produced by the toolchain. The design space exploration tool can continue the simulations of the execution configurations produced by the toolchain and record the simulation results until the completion of the simulation of the execution configurations produced by the toolchain.
The design space exploration tool can be configured to receive one or more selection criteria from a user. In response, the design space exploration tool can select, according to the performance metrics recorded in the database, an execution configuration that best meets the selection criteria.
Since different execution configurations can be utilizing different aspects of the stream engine (e.g., implemented via a coarse grained reconfigurable array), the design space exploration tool running the execution configurations in the simulator to obtain the performance matrices of the execution configurations also tests the simulator for errors or defects at the same time.
Similarly, since different execution configurations can be utilizing different aspects of the stream engine (e.g., implemented via a coarse grained reconfigurable array), the design space exploration tool running the execution configurations in the simulator also validates the conclusions made by the toolchain that the execution configurations are valid at the same time of generating the performance metrics of the execution configurations.
Thus, the design space exploration tool fills the gap between the toolchain in terms of the correctness of its outputs, and the performance/power characteristics of these outputs, and the simulator in terms of increasing the testing performed on it.
In
The dispatch interface information 111 identifies memory variables to accept arguments to be passed as input to the assembly language program 101, and data properties of the arguments. The dispatch interface information 111 can further specify the data proprieties of return value of the assembly language program 101. The dispatch interface information 111 can be used to configure the dispatch interface of a coarse grained reconfigurable array (CGRA) 103 used to implement the assembly language program 101. To execute the assembly language program 101, the memory variables identified in the dispatch interface information 111 are mapped to the tile memories in the coarse grained reconfigurable array (CGRA) 103. Thus, the dispatch interface information 111 specifies the operations of the dispatch interface to store input data to memory locations represented by the memory variables.
The memory interface information 113 identifies memory access operations that are performed in the flow description 117 to access tile memories in the coarse grained reconfigurable array (CGRA) 103. The memory access operations can include operations to store data into memory variables that are used in the flow description 117, and operations to read data from memory variables that are used in the flow description 117. To execute the assembly language program 101, the memory variables identified in the memory interface information are mapped to the tile memories in the coarse grained reconfigurable array (CGRA) 103.
The tile memory information 115 identifies memory variables used in the flow description 117 and access properties of the memory variables. Such memory variables can include the memory variables identified in the dispatch interface information 111 to store arguments or inputs to the assembly language program 101, the memory variables identified in the memory interface information 113, and other memory variables that can be used in synchronous operations of data flows in the flow description 117.
The flow description 117 specifies one or more data flow graphs. Each data flow graph identifies a synchronous flow of data through memory variables mapped to tile memories and synchronous values mapped to connections between tiles; and each data flow graph further identifies the computations (e.g., add, multiplication, bitwise shift, etc.) performed on those values on the tile data path. For example, some memory variables can be identified in dispatch interface information 111, memory interface information 113, tile memory information 115 for synchronous use (e.g., FIFO) or asynchronous use (e.g., dispatch/memory interface) of tile memories and thus for mapping to tile memories in implementations; additional variables can be used in the flow description 117 that may or may not be mapped to tile memories in implementations. For example, a synchronous value used through a FIFO in the flow description 117 is mapped to a tile memory; some variables in the flow description 117 can be mapped to tile memories using a FIFO to satisfy timing requirements in scheduling instructions for execution on tiles of the coarse grained reconfigurable array (CGRA) 103; and it is not necessary to map some variables in the flow description 117 to tile memories. The data flow graph can include identification of memory access operations specified in the memory interface information 113. The memory access operations specified in the memory interface information 113 are implemented via communications over asynchronous fabric (AF) in the coarse grained reconfigurable array (CGRA) 103. In one embodiment, the flow description 117 can have multiple segments, each specifying one data flow. Each data flow can optionally include the identification of a set of asynchronous variables specified in the dispatch interface information 111, the memory interface information 113, and the tile memory information 115. The instructions of a data flow can start execution upon receiving messages indicating the readiness of the data identified by the set of asynchronous variables. Each data flow can be programmed to send an asynchronous message to another data flow (e.g., to start execution of a loop, to continue a flow, to send a data value, etc.). Each data flow may stop with one or more instructions outputting results into asynchronous variables specified in the dispatch interface information 111, the memory interface information 113, and the tile memory information 115. New identifications of data/variables can be used in each data flow to represent data generated within the data flow. Such new variables used within each data flow are transient, since the data represented by the variables are consumed within the data flow and discarded after the execution of the data flow. Thus, asynchronous variable/data in the program 101 refers to the data being stored into a location/variable for use at an unspecified/unknown time when the data is needed; and there is no hardware imposed limitation on the time period between data arrival and data use; in contrast, synchronous variable/data refers to the data being generated for use at a time determined by a synchronous connection in the coarse grained reconfigurable array (CGRA) 103. The instructions in a data flow may not be connected based on the sequence of the instructions written in the flow description. Some instructions are tied to each other based on the data being consumed as input and data being generated as output that may be consumed synchronously, or propagated asynchronously.
Further details about the coarse grained reconfigurable array 103, the dispatch interface information 111, the memory interface information 113, and the tile memory information 115 are provided below in connection with
In
Alternatively, the user can write the assembly language program 101 without using a compiler (e.g., 107). For example, a programming/compilation tool can be adapted to receive user inputs to specify the assembly language program 101.
For example, the assembly language program 101 of
In
A typical tile 141 includes tile memories 131, . . . , 133 having synchronous connections 135 with a computing logic 137. The computing logic 137 can be configurable to execute different instructions. For example, the computing logic 137 can include a single instruction multiple data (SIMD) unit. Upon receiving a single instruction, the single instruction multiple data (SIMD) unit can operate on multiple data items in the tile memories 131, . . . , 133. For example, the computing logic 137 can include a pipelined time-multiplexed processing unit that can start execution of a new instruction at every clock cycle. Execution of an instruction can complete after a predetermined number of clock cycles. Results of executing instructions can propagate from one tile (e.g., 141) to a neighboring tile (e.g., 143) via synchronous connections 129 in a predetermined number of clock cycles. Results of executing instructions can also be accessed through memory interfaces (e.g., 123, . . . , 125, and dispatch interface 121) via asynchronous connections 127.
The coarse grained reconfigurable array 103 has synchronous connections 129 among some pairs of the tiles 141, 143, . . . , 145. For example, the synchronous connections 129 offer a direct connection between tile 141 and tile 143, but no direct connection between tile 143 and tile 145. For example, the synchronous connections 129 can connect neighboring tiles (e.g., 141, 143) to form a chain or pipeline among the tiles 141, 143, . . . , 145.
The coarse grained reconfigurable array 103 has asynchronous connections 127 between the tiles 141, 143, . . . , 145 and memory interfaces 123, . . . , 125 and a dispatch interface 121. The dispatch interface 121 can function as a memory interface. Each memory interface (e.g., 123 or dispatch interface 121) can access the tile memories of one or more tiles through the asynchronous connections 127. Each of the tiles 141, 143, . . . , 145 can have a delay register controllable to provide output of the tile for synchronization with the timing of the execution of a next instruction that uses the output. The dispatch interface 121 can communicate inputs and outputs of the coarse grained reconfigurable array 103 from or to a circuit external to the coarse grained reconfigurable array 103.
The assembly language program 101 of
With the details of the coarse grained reconfigurable array 103, the assembly language program 101 of
The operations of the coarse grained reconfigurable array 103 can be described and/or scheduled as flows of data among tile memories (e.g., 131, . . . , 133) of tiles (e.g., 141, 143, . . . , 145) through the connections 135, 129, and 127 and the computing logic 137 at various clock cycles. Since the flow description 117 describes the required data flows for the operations of the assembly language program 101, the data flows identified by the flow description 117 can be mapped to the data flows in the coarse grained reconfigurable array 103 for execution.
For example, the dispatch interface information 111 of
The assembly language program 101 of
The storing of the input data to the memory locations represented by the memory variables 153, . . . , 163 can be implemented via the operations of the dispatch interface 121 of the coarse grained reconfigurable array 103.
The dispatch interface information 111 can further specify the return value property 159 of the assembly language program 101. For example, the return value property 159 can specify the data type and/or a data size of the value to be returned by the assembly language program 101 upon completion of execution of the assembly language program 101.
For example, the memory interface information 113 of
The memory interface information 113 identifies a plurality of memory access operations 173, . . . , 183. Each memory access operation (e.g., 173 or 183) can be an operation to store data into memory or read data from memory, where the memory location is represented by a memory variable (e.g., 175 or 185) having a corresponding memory property (e.g., 177 or 187) for the data stored or accessed at the memory location. The memory access operations (e.g., 173 or 183) can be implemented via the operations of the memory interfaces 123, . . . , 125 and/or the dispatch interface 121 of the coarse grained reconfigurable array 103.
The memory access operations (e.g., 173 or 183) are associated with access IDs (e.g., 171 or 181) in the memory interface information 113 to represent the corresponding memory access operations (e.g., 173 or 183). The flow description 117 of the assembly language program 101 can use the access IDs (e.g., 171 or 181) to specify the uses of the respective memory access operations (e.g., 173 or 183) in the data flow graphs.
A memory property (e.g., 177 or 187) can identify a data type and/or a data size of the data to be operated upon via the memory access operation (e.g., 173 or 183).
For example, the tile memory information 115 of
The tile memory information 115 identifies the properties (e.g., 157, . . . , 167, 179, . . . , 189, 193, . . . ) of the respective memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ) used in the flow description 117 to identify memory locations in tiles of a coarse grained reconfigurable array 103. To execute the assembly language program 101, the memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ) are mapped to tile memories (e.g., 131, . . . , 133) of tiles (e.g., 141, 143, . . . , 145) of the coarse grained reconfigurable array 103.
The properties (e.g., 157, . . . , 167, 179, . . . , 189, 193, . . . ) can identify the memory access types, sizes, etc. of the respective memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ). Examples of memory access type can include unknown, shared, first in first out (FIFO), etc.
The memory variables specified in the tile memory information 115 can include memory variables (e.g., 153, . . . , 163) identified in the dispatch interface information 111, memory variables (e.g., 175, . . . , 185) identified in the memory interface information 113, and other memory variables used in the flow description 117 to identify memory locations of data flows. The flow description 117 further identifies operations perform to transform the data along the flows.
In one embodiment, a method is provided to specify operations in a coarse grained reconfigurable array. For example, the method of specifying operations can be performed by a user, a compiler, or a compilation/programming tool implemented via software and/or hardware in a computing device to generate the assembly language program 101 of
In the method of specifying operations, the user, compiler, and/or compilation/programming tool identifies dispatch interface information 111 representing operations to be performed via a dispatch interface 121 of a coarse grained reconfigurable array 103 to receive an input.
For example, the coarse grained reconfigurable array 103 can have a plurality of tiles 141, 143, . . . , 145 interconnected via synchronous connections 129 and 135 and asynchronous connections 127. Each of the tiles (e.g., 141) has tile memories (e.g., 131, . . . , 133) and a reconfigurable computing logic (e.g., 137). In response to an instruction, the computing logic 137 can be reconfigured to perform the operation of the instruction in the flow of data from one memory location to another in the coarse grained reconfigurable array 103.
For example, the dispatch interface information 111 can include identification of first memory variables 153, . . . , 163 for arguments 161, . . . , 161 respectively to indicate the operations of writing the input according to the arguments to the memory locations represented by the first memory variables 153, . . . , 163.
In the method of specifying operations, the user, compiler, and/or compilation/programming tool identifies memory interface information 113 representing operations to be performed via one or more memory interfaces of the coarse grained reconfigurable array.
For example, the memory interface information 113 can include identification of second memory variables 175, . . . , 185 associated with memory access operations 173, . . . , 183 for storing or retrieving data items to or from memory locations referred to and represented by the second memory variables 175, . . . , 185.
The memory interface information 113 and the dispatch interface information 111 can include the types and sizes of data items identified by memory variables (e.g., 153, 163, 175, 185) and operated upon in the respective memory access operations (e.g., 173, 183, or storing inputs according to the arguments 151, . . . , 161).
In the method of specifying operations, the user, compiler, and/or compilation/programming tool identifies tile memory information 115 about a set of memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ) referring to memory locations to be implemented in tile memories (e.g., 131, 133) of the coarse grained reconfigurable array 103.
The tile memory information 115 can further identify access types and sizes of the set of memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ) for implementation in the coarse grained reconfigurable array. The set of memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ) can include the first memory variables (e.g., 153, . . . , 163) identified in the dispatch interface information 111, the second memory variables (e.g., 175, . . . , 185) identified in the memory interface information 113, and at least one third memory variable 191 referring to a memory location in one or more synchronous data flows to be implemented via the coarse grained reconfigurable array 103.
In the method of specifying operations, the user, compiler, and/or compilation/programming tool identifies one or more synchronous data flows, through memory locations referenced via the memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ) in the tile memory information 115, to produce a result from the input. Data can be transformed via execution of instructions along the flows; and the data flows can go through other variables that do not have to be mapped to tile memories.
In the method of specifying operations, the user, compiler, and/or compilation/programming tool generates an assembly language program 101 containing the dispatch interface information 111, the memory interface information 113, the tile memory information 115, and a flow description 117 specifying the one or more data flows.
For example, a compiler 107 can be configured to compile a computer program 105 written in a high-level language to generate the assembly language program 101.
Alternatively, a compilation/programming tool can be configured to present a user interface to receive user inputs to identify the dispatch interface information 111, the memory interface information 113, the tile memory information 115, and the one or more data flows, etc. Based on the user inputs, the compilation/programming tool can check for errors and generate the assembly language program 101.
Optionally, a compiler and/or a compilation/programming tool can be further configured to map the one or more data flows specified in the assembly language program 101 to flows of data in the coarse grained reconfigurable array 103, including mapping the set of memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ) to tile memories (e.g., 131, 133) in the coarse grained reconfigurable array 103.
For example, the instruction execution schedule 223 can be generated from the assembly language program 101 of
The assembly language program 101 of
A hardware profile 239 can identify the high level structural features of a coarse grained reconfigurable array 103 to be used to run the assembly language program 101. Such high level structural features can specify to the coarse grained reconfigurable array 103 among possible implementations of coarse grained reconfigurable array. For example, the high level structural features can specify the number of tiles (141, 143, . . . , 145), the number of memory interfaces (e.g., 123, . . . , 125), the connection topology in the synchronous connections and asynchronous connections 127, numbers of clock cycle delays in the synchronous connections and asynchronous connections 127, etc., in the coarse grained reconfigurable array 103.
The hardware profile 239 has sufficient details to allow a scheduler to map instructions (e.g., 233, 243) in the data flows of the assembly language program 101 into tiles 141, 143, . . . , 145 for execution at proper time instances represented by identification of cycles (e.g., 231, 241).
For best performance, the scheduler 221 can map instructions into different tiles 141, 143, . . . , 145 for execution. Although it is possible to map all instructions of the assembly language program 101 to a single tile (e.g., 143 or 141) for execution, such a schedule is inefficient in failing to utilize the resources in remaining tiles (e.g., 145). The scheduler 221 is configured to distribute instructions to different tiles 141, 143, . . . , 145 for parallel execution for improved performance and a reduced or minimized number clock cycles to complete the computation of the assembly language program 101.
For example, the scheduler 221 can distribute instructions of different data flows to different tiles. For example, the scheduler 221 can try to place a next instruction to be placed in different tiles and identify a placement that results in a best performance for execution up to the next instruction.
In placing the instructions (e.g., 233, . . . , 243), the scheduler 221 also identifies the clock cycle (e.g., 231, . . . , 241) for the initiation of the execution of the instructions (e.g., 233, . . . , 243) in the tiles (e.g., 141, 143, . . . , 145).
In general, the instruction execution schedule 223 can include a sequence of instruction placement for each of the tiles 141, 143, . . . , 145. For example, a typical tile 141 is assigned to execute instructions 233, . . . , 243 respectively at the clock cycles 231, . . . , 241. The scheduler 221 identifies the cycles 231, . . . , 241 such that the outputs of computations can be used in correct cycles as inputs to subsequent computations. Thus, the data can flow in and among the tiles 141, 143, . . . , 145 for synchronous operations.
Further, the hardware profile 239 allows the scheduler 221 to map the memory variables in the assembly language program 101 into the tiles 141, 143, . . . , 145 for implementation via tile memories (e.g., 131, . . . , 133), as illustrated in
In
For example, in a typical tile 141, memory variables 153, . . . , 175 of the assembly language program 101 are mapped in the memory map 225 for implementation via tile memories 131, . . . , 133 of the tile 141. Other memory variables of the assembly language program 101 are mapped to other tiles (e.g., 143, . . . , 145).
When the data stored in a variable (e.g., 153) is mapped to a tile (e.g., 141) for implementation using its tile memory (e.g., 131 or 133), it is typically efficient to map the instructions operating on the data to the same tile (e.g., 141), since accessing the data via connections between tiles can take a longer time than accessing within the tile. Thus, the generation of the memory map 225 and the generation of the instruction execution schedule 223 can be performed together to identify a high performance schedule 223.
Certain hardware details can be excluded from the hardware profile 239 to allow the scheduler 221 to focus on the operation of instruction placement in the tiles 141, 143, . . . , 145. Thus, the scheduler 223 does not determine low level details of configuring the coarse grained reconfigurable array 103 for running the assembly language program 101 according to the schedule 223. Such low level details can include how the dispatch interface 121 and the memory interfaces 123, . . . , 125 are configured for the operations of the assembly language program 101, how the memory locations represented by the memory variables (e.g., 153, 175) are implemented via tile memories (e.g., 131, 133), how the connections (e.g., 135) in the tiles (e.g., 141) are configured to facilitate the correct data flows within the tiles (e.g., 141, 143, . . . , 145), etc. A more detailed hardware profile can be used to generate the configuration to execute the assembly language program 101, as illustrated in
In
A configuration generator 229 can use the hardware profile 249 to determine an execution configuration 227 for an assembly language program 101 having a memory map 225 and an instruction execution schedule 223.
The execution configuration 227 has detailed information on how to control and/or use the elements of the coarse grained reconfigurable array 103 to run the assembly language program 101.
For example, the memory map 225 specifies which tile (e.g., 141) of the coarse grained reconfigurable array 103 is used to implement the memory represented by a memory variable (e.g., 153). The configuration generator 229 can further determine, for the execution configuration 227, which portion of tile memories (e.g., 131) in the tile (e.g., 141) is used for the memory represented by the memory variable (e.g., 153).
For example, the instruction execution schedule 223 identifies which instruction (e.g., 233) is scheduled to be initiated for execution on which tile (e.g., 141) at which clock cycle (e.g., 231). The configuration generator 229 can further determine the connectivity control 236 for the configuration of the connections 135 in the tile (e.g., 141) to ensure proper flow of data in the tile (e.g., 141) for the execution of the instruction. For example, the connections 135 in the tile (e.g., 141) can be configured via controlling bits for multiplexers in the connections 135; and the connectivity control 236 can identify the controlling bits.
For example, the dispatch interface information 111 of the assembly language program 101 specifies how the dispatch interface 121 is to store inputs received as arguments 151, . . . , 161. After the determination of how the memory variables 153, . . . , 163 associated with the arguments 151, . . . , 161 are implemented using which tile memories (e.g., 131, . . . 133) in which tiles (e.g., 141, 143, . . . , 145), the configuration generator 229 can further determine the operation control 237 of the dispatch interface 121 to process inputs.
Similarly, after the determination of the tile memory implementations of the memory variables 175, . . . , 185 identified in the memory interface information 113 of the assembly language program 101, the configuration generator 229 can further determine the operation control (e.g., 247) of the memory interfaces (e.g., 123, . . . , 125) to process memory access operations 173, . . . , 183 identified in the flow description 117 using their access IDs 171, . . . , 181.
For example, the configuration generator 229 can trace the data flows specified in the flow description 117 of the assembly language program 101 and implemented according to the instruction execution schedule 223. When the tracking detects data flowing into a tile (e.g., 141) at a clock cycle 231, the configuration generator 229 identifies the incoming control 235 to be applied to facilitate data flowing into the tile 141; and when the tracking detects data flowing out of the tile 141 at the clock cycle 241, the configuration generator 229 identifies the outgoing control 245 to be applied the tile 141 to facilitate data flowing out of the tile 141 (e.g., the timing control of the delay register of the tile 141).
When the coarse grained reconfigurable array 103 is controlled and/or used according to the execution configuration 227, the coarse grained reconfigurable array 103 can run instructions of the assembly language program 101 according to the instruction execution schedule 223 to implement the computation as specified in the assembly language program 101.
In one embodiment, a method is provided to identify a configuration of a coarse grained reconfigurable array to run an assembly language program according to one embodiment. For example, the method of configuration identification can be used in a configuration generator 229 implemented as a lowering program to generate an execution configuration 227 of
In the method of configuration identification, the configuration generator 229 receives an assembly language program 101 identifying data flows through memory locations represented by memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ) and identifying instructions configured to transform data in the data flows (e.g., as specified in a flow description 117).
In the method of configuration identification, the configuration generator 229 further receives a hardware profile 249 identifying details of a coarse grained reconfigurable array 103 having a plurality of tiles 141, 143, . . . , 145 operable in parallel.
For example, the coarse grained reconfigurable array 103 can include a plurality of memory interfaces (e.g., 123). One of the memory interfaces can be configured/used as a dispatch interface 121. The coarse grained reconfigurable array 103 has the plurality of tiles 141, 143, . . . , 145 interconnected via synchronous connections 127 and asynchronous connections 129. Each of the tiles can have tile memories (e.g., 131, . . . , 133) and a reconfigurable computing logic 137.
In the method of configuration identification, the configuration generator 229 further receives an instruction execution schedule 223 identifying timing of execution of the instructions of the assembly language program 101 in the tiles 141, 143, . . . , 145.
In the method of configuration identification, the configuration generator 229 identifies memories (e.g., 131, . . . , 133) in the tiles (e.g., 141, 143, . . . , 145) configured to be used to implement the memory locations represented by the memory variables (e.g., 153, . . . , 163, 175, . . . , 185, 191, . . . ).
In the method of configuration identification, the configuration generator 229 generates an execution configuration 227 identifying operation controls (e.g., 235, 245, 236, 237, 247) to be applied in the coarse grained reconfigurable array 103 during execution of the instructions of the assembly language program 101.
For example, the assembly language program 101 includes dispatch interface information 111 representing operations to be performed to store inputs into first memory locations represented by first memory variables (153, . . . , 163). After identifying the tile memories used to implement the first memory locations, the configuration generator 229 can identify, based on the dispatch interface information 111, operating controls 237 of the dispatch interface 121 of the coarse grained reconfigurable array 103 to store the inputs to tile memories identified to implement the first memory locations.
For example, the assembly language program 101 includes memory interface information 113 representing operations to be performed to store or retrieve data at or from second memory locations represented by second memory variables (175, . . . , 185). After identifying the tile memories used to implement the second memory locations, the configuration generator 229 can identify, based on the memory interface information 113, operating controls 247 of the memory interfaces (e.g., 123 or 125) of the coarse grained reconfigurable array 103 to store or retrieve data at or from tile memories identified to implement the second memory locations.
For example, the assembly language program 101 has a flow description 117 specifying the data flows. The configuration generator 229 can trace the data flows in connection with identification of the timing of execution of the instructions to identify timing of controls (e.g., 235, 245, 236, 237, 247) to be applied in the tiles during execution of the assembly language program 101.
For example, during the tracing of the data flows, the configuration generator 229 can detect an instance of data flowing into a first tile (e.g., 141) of the coarse grained reconfigurable array 103. In response, the configuration generator 229 can identify incoming controls 235 to be applied to the first tile 141 and the timing (e.g., cycle 231) of the incoming control 235 during execution of the assembly language program 101 in the coarse grained reconfigurable array 103.
For example, during the tracing of the data flows, the configuration generator 229 can detect an instance of data flowing out of the first tile 141 of the coarse grained reconfigurable array 103. In response, the configuration generator 229 can identify outgoing controls 245 to be applied to the first tile 141 and timing (e.g., cycle 241) of the outgoing controls during the execution of the assembly language program 101 in the coarse grained reconfigurable array 103.
For example, during the tracing of the data flows with tiles, the configuration generator 229 can identify connectivity controls 236 of the tiles 141, 143, . . . , 145 for data flowing within the tiles according to the instruction execution schedule 223. For example, each respective tile (e.g., 141) among the tiles has internal connections 135 between tile memories 131, . . . , 133 and a computing logic 137. After the determination of the tile memories implementing the memory variables in the assembly language program 101, the configuration generator 229 can determine the connectivity among the tile memories (e.g., 131, . . . , 133) and the computing logic 137 to facilitate the data flows within the tiles (e.g., 141). The internal connections 135 can include multiplexers to control data paths; and the connectivity controls 236 can include setting bits to control the multiplexers to implement the data flows.
After the determination of the execution configuration 227, the coarse grained reconfigurable array 103 can be controlled according to the content of the execution configuration 227 during execution of the instructions of the assembly language program 101 according to the instruction execution schedule 223. The use of the execution configuration 227 ensures the correct operation configuration for running the assembly language program 101. Different schedules (e.g., 223) of the assembly language program 101 as input to the configuration generator 229 can result in different configurations (e.g., 227).
For example, the instructions identified in the flow description 117 of the assembly language program 101 of
For example, the flow description 117 of the assembly language program 101 of
For example, the data flow 301 specifies a synchronous data flow through memory locations represented by memory variables 313. The data flow 301 further specifies instructions 311 identifying opcodes of operations to be performed upon the data flowing between the memory locations represented by memory variables 313. Similarly, the data flow 303 specifies memory variables 323 and instructions 321.
In
For example, the group 305 is configured to have instructions 233, . . . , 243 and memory variables 153, . . . , 175 from some of the data flows 301, . . . , 303. The group 307 has instructions 331 and memory variables 333 from some of the data flows 301, . . . , 303.
For example, since memory variables 153, . . . , 175 are assigned to the group 305, instructions 233, . . . , 243 operating on the data at memory locations represented by the memory variables 153, . . . , 175 can also be assigned to the group 305. However, an instruction operating on data at a memory location represented by a memory variable not in the group 305 is not assigned to the group 305. Some instructions do not operate on memory variables mapped to tile memories for implementation and thus are not restricted by memory variable implementations in their placements.
Each of the groups 305, . . . , 307 is to be implemented on a tile (e.g., 141, 143, . . . or 145) of a coarse grained reconfigurable array 103 that is used to run the program having the data flows 301, . . . , 303.
The partitioning 309 is configured to implement constraints 315. For example, instructions that share one or more tile memory variables should be placed on a same tile and thus assigned to one or more groups to be implemented on a same tile. For example, no instructions each starting a synchronous flow may be placed on a same tile and thus in one or more groups to be implemented on a same tile. For example, no multiple sibling instructions may be placed on a same tile and thus in one or more groups to be implemented on a same tile.
Optionally, the constraints 315 can include a requirement to balance the instruction groups 305, . . . , 307 to have a similar number of instructions per group.
Optionally, the constraints 315 can include a requirement to balance memory usages of the instruction groups 305, . . . , 307 to have a similar total tile memory utilization per group.
Within each instruction group (e.g., 305), tile memory variables (e.g., 153, . . . , 175) are selected such that they can be distributed to tile memory region(s) without exceeding tile memory region size, without creating a tile memory access conflict for any instruction, etc.
Further, the constraints 315 can include hardware considerations to improve performance (e.g., placing neighbors in data flow on groups to be implemented on tiles close to each other).
As a result of the partitioning 309, a memory map 225 is generated to map memory variables (e.g., 153, 175, . . . , 333) to tiles (e.g., 141, 143, . . . , 145) of the coarse grained reconfigurable array 103. The memory map 225 reduces the choices to be explored by a scheduler in scheduling instructions using a brute force approach.
Each of different sets of instruction groups represents a different portion of a search space of possible choices for scheduling the program 101. Thus, each different set can be used by a scheduler 221 to search for a valid schedule for the running the instructions of the program in the coarse grained reconfigurable array 103. Parallel searches can be performed based on the different sets of instruction programs respectively. After finding multiple valid schedules using different sets of instruction groups, a best performing schedule can be selected for running the program 101.
In
For example, at clock cycles 361, 363, . . . , 365, the tile 141 performs initiation 341 of instructions in the slots 351, 353, . . . , 355 respectively. At clock cycle 367, the tiles 141 performs parallel executions 343 for the instructions for the slots 351, 353, . . . 355 respectively in different pipeline stages of the computing logic 137 of the tile 141. At clock cycle 369, the completion 345 of execution is reached for the instruction in the slot 351, while the tile 141 performs parallel executions 343 for the instructions in the slots 353, . . . , 355 respectively. After clock cycle 369, the instruction slot 351 can accept another instruction.
Thus, an instruction assigned to the tile 141 has multiple choices for its execution. For example, the instruction can be placed in one of the slots 351, 353, . . . , 355 and scheduled for execution at a permissible clock cycle. For example, if an instruction is scheduled in the slot 351 for execution at cycle 361, a next instruction can be schedule in the slot 353 for execution for at or after cycle 363, or in the slot 355 at or after cycle 365, or in the slot 351 after cycle 369.
For a selected option of scheduling an instruction in a slot (e.g., 351, 353, . . . , or 355) for execution at a clock cycle (e.g., 361, 363, . . . , 365, 367, . . . , or 369), the scheduler 221 can determine whether the timing requirements of connecting outputs to inputs are satisfied at least for the combination of the current instruction and the previous instructions having been scheduled before the current instruction. After finding a valid option of the current instruction, the scheduler 221 can select a next instruction for scheduling, as in
In
If it is determined 373 that there is a next instruction to be scheduled, the scheduler 221 can proceed with scheduling the instruction being selected. Otherwise, the scheduler 221 completes 387 the generation of the instruction execution schedule 223.
After the selection of an instruction for scheduling, the scheduler 221 can determine 375 an available location for placement of the instruction.
For example, based on the memory map 225 of mapping the memory variables used by the instruction (e.g., as determined through partitioning 309 as in
As illustrated
If it is determined 377 that an available location (e.g., slot) is found, the scheduler 221 can further determine 379 a valid schedule for the instruction.
For example, if the instruction is to be placed in an instruction slot, the scheduler 221 can search for a clock cycle that meets the timing requirement of the tile 141 receiving inputs for the instruction from memory locations that may receive outputs from execution of other instructions.
If the scheduler 221 determines 381 that a valid schedule is found for the instruction, the scheduler 221 can further select 371 a next instruction for scheduling.
However, if the scheduler 221 determines 381 that a valid schedule is found for the current instruction being scheduled, the scheduler 221 can determine 375 an alternative location of the scheduling of the current instruction. If no suitable location can be found for the instruction, the scheduler 221 can identify 383 the schedule of the previous instruction as invalid. Thus, the previous instruction is selected to be rescheduled. The scheduler 221 moves 385 to the previous instruction as the instruction for scheduling.
In determining 375 an available location and determining 379 a valid schedule, the scheduler 221 excludes options that have been previously identified 383 as invalid and/or having been previously evaluated.
The scheduler 221 can continue the loops as shown in
In one embodiment, a method is provided to schedule instructions of an assembly language program for execution on a coarse grained reconfigurable array according to one embodiment. For example, the method of scheduling instructions can be configured in the scheduler 221 in
In the method of scheduling instructions, the scheduler 221 receives an assembly language program 101 identifying data flows 301, . . . , 303 through memory locations represented by memory variables 313, . . . , 323 and identifying instructions 311, . . . , 321 configured to transform data in the data flows 301, . . . , 303.
In the method of scheduling instructions, the scheduler 221 receives a hardware profile 239 identifying features of a coarse grained reconfigurable array 103 having a plurality of tiles 141, 143, . . . , 145 operable in parallel.
In the method of scheduling instructions, the scheduler 221 generates a memory map 225 identifying, for each respective memory variable (e.g., 153 or 175) in the assembly language program 101, one of the tiles (e.g., 141) that contains a memory location represented by the respective memory variable (e.g., 153 or 175).
For example, the scheduler 221 can be configured to generate the memory map 225 by partitioning 309 the memory variables 313, . . . , 323 and the instructions 311, . . . , 321 of the program 101 into a plurality of groups 305, . . . , 307. Each of the groups 305, . . . , 307 are configured to be implemented on one of the tiles 141, 143, . . . , 145. The groups 305, . . . , 307 as partitioned for implemented on the tiles 141, 143, . . . , 145 meet constraints and/or requirements.
For examples, instructions that share one or more tile memory variables are placed in one or more groups to be implemented on a same tile.
For example, no instructions each starting a synchronous flow are placed in one or more groups to be implemented on a same tile.
For example, no multiple sibling instructions are placed in one or more groups to be implemented on a same tile, etc.
For example, the partitioning 309 can be performed to balance a number of instructions implemented per tile, to balance a number of memory variables implemented per tile, and/or to balance an amount of memory usage implemented per tile, etc.
In the method of scheduling instructions, the scheduler 221 assigns, based on the memory map 225, the instructions 311, . . . , 321 to the tiles 141, 143, . . . , 145 for execution.
For example, each respective instruction (e.g., 233) among the instructions 311, . . . , 321 is assigned to a tile containing memory variables (e.g., 153, 175) having data to be operated upon by the respective instruction (e.g., 233).
In the method of scheduling instructions, the scheduler 221 provides, as an output, an instruction execution schedule 223 identifying timing of execution of the instructions 311, . . . , 321 in the tiles.
For example, the timing of execution of the instructions 311, . . . , 321 can be determined by: selecting 371 a current instruction (e.g., 233) for scheduling; determining 375 an available location via identifying a slot (e.g., 353) in a first tile (e.g., 141) containing memory variables (e.g., 153, 175) used by the current instruction (e.g., 233); and determining 379 a valid schedule for the current instruction (233) via searching for a clock cycle (e.g., 231) for execution of the current instruction (e.g., 233) in the slot (e.g., 353).
In response to a determination 381 that a valid spoke RAM slot (e.g., for execution at a valid clock cycle 231) is found for execution of the current instruction, the scheduler 221 can select 371 a next instruction for scheduling.
In response to a determination 381 that no valid spoke RAM slot is found for execution of the current instruction (e.g., 233) in the slot (e.g., 353), the scheduler 221 is configured to search for an available slot in the first tile. In response to such an available slot being found, the scheduler 221 can search for a clock cycle for execution of the current instruction in the available slot.
However, in response to a determination 375 that no available slot is found, the scheduler 221 is configured to: determine a prior instruction scheduled before the current instruction; identify 383 a first schedule previously determined for the prior instruction as invalid; and start determining of/searching for a second valid schedule for the prior instruction.
Different memory maps (e.g., 225) can be generated for the same assembly language program 101; and some operations (e.g., in assigning instructions to tiles and identifying timing of the instructions) can be performed in parallel based on different memory maps to generate different instruction execution schedules (e.g., 223). The scheduler 221 can then evaluate the performance levels of the instruction execution schedules (e.g., 223) (e.g., based on latency, energy consumption, etc.). The best performing schedule can be selected for running the assembly language program 101 on the coarse grained reconfigurable array 103.
In
For example, instructions 311, . . . , 321 of an assembly language program 101 of
For the given program 101 having instructions 311, . . . , 321, the scheduler 221 can generate a computation graph to represent the execution dependency conditions 415 in the program 101. For example, the execution dependency conditions 415 can include the dependency of outputs generated by some instructions as inputs to other instructions. For example, the execution dependency conditions 415 can include memory dependency of instructions implemented on a tile depending on memory variables being implemented in the same tile.
For execution of the instructions 311, . . . , 321 on the coarse grained reconfigurable array 103, the scheduler 221 is configured to determine placements of the instructions 311, . . . , 321 in the tiles 141, 143, . . . , 145 and/or in the instruction slots (e.g., 351, 353, . . . , 355) of the tiles of the coarse grained reconfigurable array 103.
The scheduler 221 can be configured to identify the placement 419 of one instruction (e.g., 405) at a time in view of prior instructions 401 that have been placed before the next instruction 405.
In
Data representing the schedule 413 of the scheduled instructions 401, the execution dependency conditions 415, and the next instruction 405 can be provided as the input 411 to the artificial neural network 417 to generate a placement 419 of the next instruction 405. The placement 419 can include a tile ID 421 and a slot ID 423 identifying the tile (e.g., 141) and the slot (e.g., 351 or 353) in the tile (e.g., 141) for the execution of the instruction 405 being placed next.
After the determination of the placement 419 for the next instruction 405, the next instruction 405 can be added in the group of scheduled instructions 401 with an updated schedule 413. A further instruction can be selected from the remaining instructions 403; and the artificial neural network 417 can be used again to generate the placement for the further instruction. The operations can be repeated until no instructions remaining to be scheduled.
Optionally, a mask can be applied to the output of the scheduler 221 to filter out invalid placements. This ensures production of valid node placements. For example, the placement 419 is chosen from possibilities that are limited to placements that adhere to the constraints of the streaming engine depending on the properties of the instruction 405 to place in input 411.
In general, the scheduler 221 having the artificial neural network 417 can be used in connection with other types of scheduler 221.
For example, some of the instructions (e.g., 405) can be selected for placement by the scheduler 221 having the artificial neural network 417; and other instructions (e.g., 401) can be scheduled by another scheduler 221.
The artificial neural network 417 is trained via reinforcement learning as an actor to determine an action of the placement 419 in response to the input 411, as illustrated in
In
A performance evaluator 409 is configured to determine the cycle count 425 of executing scheduled instructions 401 and the next instruction 405 according to the schedule 413 and the test placement 418. The cycle count 425 represents the latency of producing the output of the scheduled instructions 401 and the next instruction 405 executed according to the schedule 413 and the test placement 418. Thus, the performance 435 of selecting the test placement 418 in response to the input 411 can be ranked/scored based on the cycle count 425.
The sample 431 includes the input 411, a test output 433 having the test placement 418, and the performance 435 of producing the test output 433 based on the input 411. The performance 435 can be used to represent a reward for the artificial neural network 417 making the selection of the test placement 418.
Different test placements (e.g., 418) can lead to different performances (e.g., 435). Reinforcement learning (e.g., using proximal policy optimization (PPO)) can be used to train the artificial neural network 417 to improve its capability in selecting high performance placements, as in
Optionally, some or all of the test placements (e.g., 418) can be selected or generated using the current version of the artificial neural network 417 of the scheduler 221, before further training of the artificial neural network 417. Optionally, some of the test placements (e.g., 418) can be selected using another scheduler 221 (e.g., using the approach of
After a sample 431 is generated for placing the next instruction 405 after the generation of a schedule 413 for the scheduled instruction 401, a next sample can be generated to the placement of a further instruction selected from the remaining instructions 403 in view of the combined placements of the scheduled instructions 401 and the next instruction 405. Such operations can be repeated to generate samples of placing a next instruction (e.g., 405) in view of different amounts of scheduled instructions 401 (e.g., 401) of the program 101, including cases where the next instruction (e.g., 405) is the last instruction to be placed/scheduled.
In
The critic 443 is configured to predict performances (e.g., 435) of placements selected by the scheduler response to inputs (e.g., 411). The artificial neural network 447 of the critic 443 is adjusted during the training to match the predicted performances generated by the artificial neural network 447 of the critic 443 and the corresponding performances (e.g., 435) specified in the samples 431 and generated by the performance evaluator 409.
The trained critic 443 is used to guide the scheduler 221 in making placements for improved/maximized performance.
Optionally, the collection of samples 431 used in the reinforcement learning 441 can be trimmed/selected to balance a portion of samples that can reach a final solution of scheduling all instructions of the program 101 and another portion of the samples that cannot reach a final solution.
In one embodiment, the artificial neural network 417 of the scheduler 221 and the artificial neural network 447 of the critic 443 are trained according to the samples 431 to minimize cost according to a cost function.
For example, the cost function can be constructed to evaluate a cost based on a loss associated with action of placement generated by the artificial neural network 417 of the scheduler 221 and a loss associated with reward evaluate by the artificial neural network 447 of the critic 443.
The loss associated with action of placement generated by the artificial neural network 417 of the scheduler 221 can be evaluated based on selecting a smaller one from loss candidates evaluated based on an advantage weighted by a ratio. For example, the ratio can be the exponential function of a logarithm function of a probability ratio that is equal to the probability of the action from training divided by the probability of the action in samples; and the advantage can be the difference between the reward/performance from samples and the corresponding reward/performance predicted by the critic 443. The loss associated with reward evaluate by the artificial neural network 447 of the critic 443 can be based on mean square error between reward/performance from samples and the corresponding reward/performance predicted by the critic 443.
The total loss to be minimized in the reinforcement learning 441 can be based on a combination of the loss resulting from the artificial neural network 447 of the critic 443 predicting reward/performance different from the samples 431 and the loss resulting from decreasing in predicted reward/performance caused by the artificial neural network 417 of the scheduler 221 selection actions of placements different from the samples 431.
Through adjusting the artificial neural network 447 of the critic 443 and the artificial neural network 417 of the scheduler 221 to minimize the total loss, the artificial neural network 447 of the critic 443 is trained to predict reward/performance according to the samples 431; and the artificial neural network 417 of the scheduler 221 is trained to select placements that maximize reward/performance at the same time.
The reinforcement learning 441 and the artificial neural network 417 can be used with other scheduling techniques to generate an instruction execution scheduler 223 of an assembly language program 101.
For example, the brute force search technique of
Instead of searching for a schedule for the next instruction as in
Further, when the scheduler 221 uses the brute force search technique of
In one embodiment, a method is provided to place instructions in circuit units of a coarse grained reconfigurable array. For example, the method of instruction placement can be performed by a scheduler 221 implemented via software and/or hardware in a computing device to determine a placement of an instruction among a plurality of possible placements using a reinforcement learning technique as described with
In the method of instruction placement, the scheduler 221 receives first data representative of execution dependency conditions 415 of instructions 401 and 405 of a program 101.
For example, the program 101 can be an assembly language program 101 having a flow description 117 identifying data flows 301, . . . , 303 through memory locations represented by memory variables 313, . . . , 323 and identifying the instructions 311, . . . , 321 configured to transform data in the data flows 301, . . . , 303.
For example, the first data can include data identifying dependency of execution of first instructions in receiving, as input, outputs generated from execution of second instructions.
For example, the first data can include further data identifying dependency of third instructions, scheduled to be executed in a respective tile, in accessing data at memory locations represented by memory variables implemented in the same respective tile.
In the method of instruction placement, the scheduler 221 further receives second data representative of a schedule 413 of a first portion of the instructions (e.g., 401) of the program for execution in a device having a plurality of circuits units operable in parallel.
For example, the device can include a coarse grained reconfigurable array 103 having a plurality of tiles 141, 143, . . . , 145 operable in parallel as the plurality of circuit units respectively. Each of the tiles (e.g., 141) can have a plurality of instruction slots 351, 353, . . . , 355 for pipelined execution. The schedule 413 can have a placement for each respective instruction among the instructions 401; and the placement for the respective instruction can include identification of a tile (e.g., 141) among the tiles 141, 143, . . . , 145, and a slot (e.g., 351 or 353) among instruction slots in the tile (e.g., 141) for execution of the respective instruction. Although the reinforcement learning (RL) techniques are discussed in connection with a scheduler 221 for a coarse grained reconfigurable array (103), the reinforcement learning (RL) techniques for placement can also be used to schedule chip placement tasks.
In the method of instruction placement, the scheduler 221 further receives third data identifying a next instruction 405 selected from a second portion of the instructions (e.g., 403) of the program 101 remaining to be scheduled for execution in the device.
For example, the next instruction 405 can be selected via a random selection, an incremental selection, or a topological ordering based selection from the remaining instructions 403 to be scheduled.
In the method of instruction placement, the scheduler 221 applies the first data, the second data and the third data as input 411 to a first artificial neural network 417.
In the method of instruction placement, the scheduler 221 selects, using the first artificial neural network 417, a placement 419 of the next instruction 405 in one of the circuit units from a plurality of possible placements of the next instruction 405 in the device.
For example, the placement 419 can include a tile ID 421 and a slot ID 423 indicating of scheduling the next instruction 405 for execution in a slot (e.g., 351 or 353) represented by the slot ID 423 in a tile (e.g., 141) represented by the tile ID 421.
The scheduler 221 can have a second artificial neural network 447 trained to generate, in response to an input 411, a performance measure of the first artificial neural network 417 selecting the placement 419 of the next instruction 405 from the plurality of possible placements.
To train the first artificial neural network 417 as an actor and the second artificial neural network 447 as a critic via reinforcement learning 441, a plurality of samples 431 can be generated. Each respective sample among the samples 431 is generated to include/specify: a respective input (e.g., 411) to the first artificial neural network 417, a respective placement 418 of the respective instruction as a possible output of the artificial neural network 417, and a respective performance measure (e.g., performance 435) for the respective placement 418 as a reward for the actor to make the action of selecting the respective placement 418. The respective input (e.g., 411) can identify a respective schedule (e.g., 413) of a respective portion of the instructions (e.g., 401) of the program 101. The respective input (e.g., 411) can identify a respective instruction to be scheduled in addition to the scheduling of instructions 401 according to the respective schedule 413. The respective performance measure (e.g., performance 435) can be determined based on a cycle count 425 of executing the scheduled instructions (e.g., 401) and the respective instruction (e.g., 405) according to the respective schedule (e.g., 413) and the respective placement (e.g., 418).
The samples 431 and a technique of proximal policy optimization (PPO) of reinforcement learning to minimize a loss function can be used to train the first artificial neural network 417 and the second artificial neural network 447.
For example, the loss function is based on evaluating a first loss representing a reduction in performance measure resulting from the first artificial neural network selecting placements different from corresponding placements in the samples, and a second loss resulting from the second artificial neural network generating performance measures different from corresponding performance measures in the samples.
For example, the first loss (e.g., actor loss) can be based on a difference between a performance measure generated by the second artificial neural network responsive to an input specified in the samples and a corresponding performance measure specified in the samples, where the difference is weighted according to an exponential function of a logarithm function of a probability ratio that is equal to a ratio between: a probability of placements selected by the first artificial neural network responsive to inputs specified in the samples; and a probability of corresponding placements specified in the samples.
For example, the second loss (e.g., critic loss) can be based on a mean square error between performance measures generated by the second artificial neural network responsive to inputs specified in the samples and corresponding performance measures specified in the samples.
The generation of the samples 431 and the application of the reinforcement learning 441 can be performed in iterations.
For example, in searching for a valid combination of placements of instructions of different portions of the program 101, different placement options can be tested. The performances of the tested placements can be evaluated to generate the plurality of samples 431 for the train of the first artificial neural network 417 and the second artificial neural network 447. Some of the tested placement can be selected using the first artificial neural network 417 as previously trained. The further training performed using the samples 431 can improve the first artificial neural network 417 in making selections.
In
For example, an assembly language program 101 can be specified using the techniques of assembly language adapted for data flow and discussed in connection with
For example, the toolchain 201 can include a scheduler 221 and a configuration generator 229 configured to generate an execution configuration 227 as discussed in connection with
For example, the toolchain 201 (e.g., scheduler 221) can be configured to schedule instructions for execution on tiles using the techniques of
For a given assembly language program 101 that specifies an application of data flow, the toolchain 201 can generate, for a coarse grained reconfigurable array 103 represented by a hardware profile 249, a plurality of execution configurations 227, . . . , 228. The toolchain 201 can discard invalid configurations and report valid configurations 227, . . . , 228 to the exploration tool 205.
The exploration tool 205 is configured to store the execution configurations 227, . . . , 228, determined to be valid by the toolchain 201 into the database 203, and request the simulator 271 to simulate the execution of the assembly language program 101 according to the execution configurations 227, . . . 228.
For example, the exploration tool 205 can request the simulator 271 to simulate the execution of the assembly language program 101 in the coarse grained reconfigurable array 103, represented by the hardware profile 249, according to an execution configuration 227. From the result of the simulation, the exploration tool 205 can determine whether the assembly language program 101 running according to the execution configuration 227 accurately generates the data flow outputs as expected. Further, from the result of the simulation, the exploration tool 205 can determine the performance metrics 207 of the execution configuration 207 in implementing the assembly language program 101 in the coarse grained reconfigurable array 103, such as the speed of executing the assembly language program 101, the energy consumption of executing the assembly language program 101, etc. The exploration tool 205 can also determine the instruction packing density of the execution configuration 227, the tile utilization rate of the execution configuration 227, the data flow throughput of the assembly language program 101 implemented according to the execution configuration, etc.
Similarly, the exploration tool 205 can use the simulator 271 to simulate the execution of the assembly language program 101 according to another execution configuration 228 produced by the toolchain 201. From the simulation, the exploration tool 205 can identify the performance metrics 208 of the execution configuration 228.
In some embodiments, the exploration tool 205 can use multiple instances of the simulator 271 to run simulations of the assembly language program 101 according to different execution configurations (e.g., 227, 228) in parallel. Thus, the time period to complete the simulations of the execution configurations 227, . . . , 228 in the database 203 can be reduced.
After generating the performance metrics 207, . . . , 208 of the execution configurations, determined to be valid in a design space for the assembly language program 101 by the toolchain 201, the exploration tool 205 can be used to select an execution configuration (e.g., 227) according to a set of configuration selection criteria 209.
For example, the user of the exploration tool 205 can specify a criterion based on the speed of the execution of the assembly language program 101 and ignore the other aspects of the performance metrics 207, . . . , 208 in comparing the execution configuration 227, . . . , 228. A fastest execution configuration (e.g., 227) can be identified and provided by the exploration tool 205 to the user.
Optionally, the exploration tool 205 can present, to the user, a range of a performance indicator (e.g., energy expenditure in executing the assembly language program 101 in the coarse grained reconfigurable array 103) of the execution configurations 227, . . . , 228 in the database 203. The user can specify a selection criterion based on the range. For example, the user can select a sub-range from the range as a selection criterion to exclude execution configurations having the performance indicator outside of the sub-range. The user can then use one or more selection criteria to narrow the selections and/or select a best performant.
Optionally, the exploration tool 205 can be configured to allow the user to specify a composite performance indicator that is a function of a plurality of performance indicators. The exploration tool 205 can be configured to identify and present a best performant (e.g., execution configuration 227) according to the composite performance indicator.
Thus, the exploration tool 205 allows a user to explore the design space of valid execution configurations 227, . . . , 228 for running the assembly language program 101 in a coarse grained reconfigurable array 103 to find a desirable execution configuration (e.g., 227).
Optionally, the exploration tool 205 can be configured to include the toolchain 201 and/or the simulator 271. Optionally, the exploration tool 205 is configured to run on a computing apparatus having one or more coarse grained reconfigurable arrays; and the exploration tool 205 can use the coarse grained reconfigurable arrays of the computing apparatus to run the assembly language program 101 according to execution configurations (e.g., 227, 228) to obtain the performance metrics (e.g., 207, 208).
In
For example, the performance metrics 207 can include the clock cycle count 251 used for the completion of the execution of the assembly language program 101 in the coarse grained reconfigurable array 103 according to the execution configuration 227.
For example, the performance metrics 207 can include the amount of energy consumption 253 used for the completion of the execution of the assembly language program 101 in the coarse grained reconfigurable array 103 according to the execution configuration 227.
For example, the performance metrics 207 can include the instruction flow latency 255 used for the completion of the execution of the assembly language program 101 in the coarse grained reconfigurable array 103 according to the execution configuration 227.
For example, the performance metrics 207 can include the density 257 of packing instructions on tiles (e.g., 141, 143, . . . , 145) according to the execution configuration 227 for the execution of the assembly language program 101 in the coarse grained reconfigurable array 103.
For example, the performance metrics 207 can include the number 259 of tiles being utilized according to the execution configuration 227 for the execution of the assembly language program 101 in the coarse grained reconfigurable array 103.
Optionally, the exploration tool 205 can be configured to run on a computing apparatus having one or more coarse grained reconfigurable arrays (e.g., 103). The exploration tool 205 can configure a coarse grained reconfigurable array 103 to run an assembly language program 101 according to an execution configuration 227 to generate the performance metrics 207.
In
The execution configuration 227 can be loaded into a simulator 271 (e.g., by the exploration tool 205 of
The exploration tool 205 can compare the predicted attributes 273 and the measured attributes 275 to identify discrepancies that can be indicative of errors or defects in the programming of the simulator, the toolchain, or both. For example, the exploration tool 205 can be configured to present, to the user, error detection results 277 generated from comparing the predicted attributes 273 and the measured attributes 275.
Optionally, the simulator 271 can be replaced with one or more coarse grained reconfigurable arrays (e.g., 103) that can be configured by the exploration tool 205 to run the assembly language program 101 according to the execution configuration 227.
Optionally, the predicted attributes 273 can include a set of outputs generated by the assembly language program 101 based on a set of inputs (e.g., random inputs); and the measured attributes 275 include the corresponding set of outputs generated by the simulator 271 from the execution configuration 227 based on the same set of inputs (e.g., random inputs).
For example, the exploration tool 205 can be implemented in a computing apparatus (e.g., illustrated in
For example, the computing apparatus can be configured via an exploration tool, a toolchain 201, and a simulator 271 (or one or more one or more coarse grained reconfigurable arrays (e.g., 103)) to perform the method of
At block 291, the method includes receiving, in the computing apparatus, a program 101 identifying data flows through memory locations.
For example, the program 101 can be an assembly language program 101 configured to identify the data flows through the memory locations represented by memory variables (e.g., 153, . . . , 163; 175, . . . , 185; and 153, . . . , 163, 175, . . . , 185, 191, . . . ) identifying instructions configured to transform data in the data flows.
For example, the computing apparatus can be configured with a toolchain 201 to generate the configurations 227, . . . , 228 from the assembly language program 101 and a hardware profile 249 of the device (e.g., the coarse grained reconfigurable array 103). The computing apparatus can be configured with an exploration tool 205 to provide the assembly language program 101 to the toolchain 201 to generate the plurality of configurations 227, . . . , 228. The exploration tool 205 can record the plurality of configurations 227, . . . , 228 in a database 203.
At block 293, the method includes identifying, by the computing apparatus, a plurality of configurations 227, . . . , 228 of executing the program 101 on a device having a plurality of circuit units configured to operate in parallel.
For example, the device can include a coarse grained reconfigurable array 103 having a plurality of tiles 141, 143, . . . , 145 configured as the plurality of circuit units to operate in parallel; and each of the tiles 141, 143, . . . , 145 can have a plurality of instruction slots (e.g., 351, 353, . . . , 355) for pipelined execution.
For example, the assembly language program 101 is configured to identify data flow operations through a dispatch interface 121, one or more memory interfaces 123, . . . , 125, and tiles 141, 143, . . . , 145 of coarse grained reconfigurable arrays (e.g., 103).
At block 295, the method includes determining, by the computing apparatus, performance metrics 207, . . . , 208 of the configurations 227, . . . , 228 in execution of the program 101.
For example, the performance metrics 207, 208 can include: an indicator of a speed of executing the assembly language program in the device according to the first configuration (e.g., clock cycle count 251), an indicator of an amount of energy consumed by the device in execution of the assembly language program according to the first configuration (e.g., amount of energy consumption 253), an indicator of tile utilization level of the device in execution of the assembly language program according to the first configuration (e.g., utilized tile number 259), instruction packing density 257, instruction flow latency 255, and/or other indicators.
For example, the computing apparatus can be configured with a simulator 271 to perform simulations of running the assembly language program 101 on the device, represented by the hardware profile 249, according to the execution configurations 227, . . . , 228. The exploration tool 205 can automate the simulations performed via the simulator 271 by providing the plurality of configurations 227, . . . , 228 to the simulator 271 of the device to determine the performance metrics 207, . . . , 208 of the configurations 227, . . . , 228. The exploration tool 205 can record the performance metrics 207, . . . , 208 of the execution configurations 227, . . . , 228 in the database 203.
At block 297, the method includes receiving, in the computing apparatus, a user request identifying one or more criteria 209.
At block 299, the method includes identifying, by the computing apparatus from the plurality of configurations 227, . . . , 228 in response to the user request, a first configuration 227 of executing the program on the device based on the one or more criteria 209 and the performance metrics 207, . . . , 208.
For example, the exploration tool 205 can apply the configuration selection criteria 209 to the performance metrics 207, . . . , 208 to select the first configuration 227 that is a best match to the criteria 209.
The exploration tool 205 can compare first attributes 273 of the configurations (e.g., 227) predicted by the toolchain 201 and second attributes 275 of the configurations (e.g., 227) measured using the simulator 271 to detect errors in the toolchain 201 and the simulator 271. For example, the comparison can include comparing validity of the first configuration 227 determined by the toolchain 201 and validity of the first configuration 227 determined by the simulator 271. For example, the comparison can include comparing an output of the assembly language program 101 determined by the toolchain 201 and an output of the assembly language program 101 determined by the simulator 271 in a simulation using the first configuration 227. For example, the comparison can include comparing a number of clock cycles determined by the toolchain 201 for an execution of the assembly language program 101 using the first configuration 227 and a number of clock cycles determined by the simulator 271 in the simulation using the first configuration 227.
A discrepancy in an attribute 273 predicted by the toolchain 201 and a corresponding attribute 275 measured via the simulator 271 can be an indication of an error, defect or bug in the toolchain 201, the simulator 271, or both.
Optionally, the computing apparatus has one or more coarse grained reconfigurable arrays (e.g., 103); and the exploration tool 205 can use the one or more coarse grained reconfigurable arrays (e.g., 103) to execute the assembly language program 101 according to the configurations 227, . . . , 228 recorded in the database 203 to determine the performance metrics 207 of the configurations 227, . . . , 228.
For example, when there is a discrepancy in an attribute 273 predicted by the toolchain 201 and a corresponding attribute 275 measured via the simulator 271, the attribute determined from running the assembly language program 101 in the coarse grained reconfigurable arrays (e.g., 103) can be used to determine whether the error is in the toolchain 201, the simulator 271, or both.
Optionally, the simulator 271 can be omitted; and the performance metrics 207, . . . , 208 are measured using the coarse grained reconfigurable arrays (e.g., 103) of the computing apparatus.
Optionally, the exploration tool 205 is configured to test the toolchain 201 and/or the simulator 271; and the measuring and recording of the performance metrics 207, . . . , 208 can be skipped.
Thus, at least one embodiment provides an exploration tool 205 of a design space of configurations (e.g., 227, . . . , 228) to execute a data flow program (e.g., 101) using circuit tiles (e.g., 141, 143, . . . , 145) of a coarse grained reconfigurable array 103. The exploration tool 205 can identify different configurations 227, . . . , 228 for the program 101 and determine performance metrics 207, . . . , 208 of the configurations 227, . . . , 228. A user of the exploration tool 205 can provide one or more criteria 209 in a request to the tool 205; and in response, the tool 205 can identify, from the different configurations 227, 228 and based on the one or more criteria 209 applied to the performance metrics 207, . . . , 208, a first configuration 227 of executing the program 101 on the coarse grained reconfigurable array 103. For example, the exploration tool 205 can use a toolchain 201 to generate the configurations 227, . . . , 228 and use a simulator 271 to run simulations of executions of the program 101 according to the configurations 227, . . . , 228. The exploration tool 205 can compare attributes 273 and 275 determined by the toolchain 201 and the simulator 271 for consistency in detecting errors or defects in the toolchain 201 and the simulator 271.
The computer system of
In some embodiments, the machine can be connected (e.g., networked) to other machines in a local area network (LAN), an intranet, an extranet, and/or the internet. The machine can operate in the capacity of a server or a client machine in client-server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, or as a server or a client machine in a cloud computing infrastructure or environment.
For example, the machine can be configured as a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a web appliance, a server, a network router, a switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
The example computer system illustrated in
The processing device 502 in
The computer system of
The data storage system 518 can include a machine-readable medium 524 (also known as a computer-readable medium) on which is stored one or more sets of instructions 526 or software embodying any one or more of the methodologies or functions described herein. The instructions 526 can also reside, completely or at least partially, within the main memory 504 and/or within the processing device 502 during execution thereof by the computer system, the main memory 504 and the processing device 502 also constituting machine-readable storage media.
In one embodiment, the instructions 526 include instructions to implement functionality corresponding to an exploration tool 205, such as described with reference to
The present disclosure includes methods and apparatuses which perform the methods described above, including data processing systems which perform these methods, and computer readable media containing instructions which when executed on data processing systems cause the systems to perform these methods.
A typical data processing system may include an inter-connect (e.g., bus and system core logic), which interconnects a microprocessor(s) and memory. The microprocessor is typically coupled to cache memory.
The inter-connect interconnects the microprocessor(s) and the memory together and also interconnects them to input/output (I/O) device(s) via I/O controller(s). I/O devices may include a display device and/or peripheral devices, such as mice, keyboards, modems, network interfaces, printers, scanners, video cameras and other devices known in the art. In one embodiment, when the data processing system is a server system, some of the I/O devices, such as printers, scanners, mice, and/or keyboards, are optional.
The inter-connect can include one or more buses connected to one another through various bridges, controllers and/or adapters. In one embodiment the I/O controllers include a universal serial bus (USB) adapter for controlling USB peripherals, and/or an IEEE-2394 bus adapter for controlling IEEE-2394 peripherals.
The memory may include one or more of: read only memory (ROM), volatile random access memory (RAM), and non-volatile memory, such as hard drive, flash memory, etc.
Volatile RAM is typically implemented as dynamic RAM (DRAM) which requires power continually in order to refresh or maintain the data in the memory. Non-volatile memory is typically a magnetic hard drive, a magnetic optical drive, an optical drive (e.g., a DVD RAM), or other type of memory system which maintains data even after power is removed from the system. The non-volatile memory may also be a random access memory.
The non-volatile memory can be a local device coupled directly to the rest of the components in the data processing system. A non-volatile memory that is remote from the system, such as a network storage device coupled to the data processing system through a network interface such as a modem or ethernet interface, can also be used.
In the present disclosure, some functions and operations are described as being performed by or caused by software code to simplify description. However, such expressions are also used to specify that the functions result from execution of the code/instructions by a processor, such as a microprocessor.
Alternatively, or in combination, the functions and operations as described here can be implemented using special purpose circuitry, with or without software instructions, such as using application-specific integrated circuit (ASIC) or field-programmable gate array (FPGA). Embodiments can be implemented using hardwired circuitry without software instructions, or in combination with software instructions. Thus, the techniques are limited neither to any specific combination of hardware circuitry and software, nor to any particular source for the instructions executed by the data processing system.
While one embodiment can be implemented in fully functioning computers and computer systems, various embodiments are capable of being distributed as a computing product in a variety of forms and are capable of being applied regardless of the particular type of machine or computer-readable media used to actually effect the distribution.
At least some aspects disclosed can be embodied, at least in part, in software. That is, the techniques may be carried out in a computer system or other data processing system in response to its processor, such as a microprocessor, executing sequences of instructions contained in a memory, such as ROM, volatile RAM, non-volatile memory, cache or a remote storage device.
Routines executed to implement the embodiments may be implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions referred to as “computer programs.” The computer programs typically include one or more instructions set at various times in various memory and storage devices in a computer, and that, when read and executed by one or more processors in a computer, cause the computer to perform operations necessary to execute elements involving the various aspects.
A machine readable medium can be used to store software and data which when executed by a data processing system causes the system to perform various methods. The executable software and data may be stored in various places including for example ROM, volatile RAM, non-volatile memory and/or cache. Portions of this software and/or data may be stored in any one of these storage devices. Further, the data and instructions can be obtained from centralized servers or peer to peer networks. Different portions of the data and instructions can be obtained from different centralized servers and/or peer to peer networks at different times and in different communication sessions or in a same communication session. The data and instructions can be obtained in entirety prior to the execution of the applications. Alternatively, portions of the data and instructions can be obtained dynamically, just in time, when needed for execution. Thus, it is not required that the data and instructions be on a machine readable medium in entirety at a particular instance of time.
Examples of computer-readable media include but are not limited to non-transitory, recordable and non-recordable type media such as volatile and non-volatile memory devices, read only memory (ROM), random access memory (RAM), flash memory devices, floppy and other removable disks, magnetic disk storage media, optical storage media (e.g., compact disk read-only memory (CD ROM), digital versatile disks (DVDs), etc.), among others. The computer-readable media may store the instructions.
The instructions may also be embodied in digital and analog communication links for electrical, optical, acoustical or other forms of propagated signals, such as carrier waves, infrared signals, digital signals, etc. However, propagated signals, such as carrier waves, infrared signals, digital signals, etc. are not tangible machine readable medium and are not configured to store instructions.
In general, a machine readable medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant, manufacturing tool, any device with a set of one or more processors, etc.).
In various embodiments, hardwired circuitry may be used in combination with software instructions to implement the techniques. Thus, the techniques are neither limited to any specific combination of hardware circuitry and software nor to any particular source for the instructions executed by the data processing system.
The above description and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding. However, in certain instances, well known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure are not necessarily references to the same embodiment; and, such references mean at least one.
In the foregoing specification, the disclosure has been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.
The present application claims priority to Prov. U.S. Pat. App. Ser. No. 63/496,936 filed Apr. 18, 2023, the entire disclosures of which application are hereby incorporated herein by reference. The present application relates to U.S. patent application Ser. No. 17/705,099, filed Mar. 25, 2022 and entitled “Programming a Coarse Grained Reconfigurable Array through Description of Data Flow Graphs”, U.S. patent application Ser. No. 17/705,112, filed Mar. 25, 2022 and entitled “Schedule Instructions of a Program of Data Flows for Execution in Tiles of a Coarse Grained Reconfigurable Array”, U.S. patent application Ser. No. 17/705,091, filed Mar. 25, 2022 and entitled “Configure a Coarse Grained Reconfigurable Array to Execute Instructions of a Program of Data Flows”, and Prov. U.S. Pat. App. Ser. No. 63/323,949, filed Mar. 25, 2022 and entitled “Mapping Workloads to Circuit Units in a Computing Device via Reinforcement Learning”, the disclosures of which applications are hereby incorporated herein by references.
Number | Date | Country | |
---|---|---|---|
63496936 | Apr 2023 | US |