Customisable Processors represent an emerging and effective paradigm for executing embedded application under high performance, short time to market, and low power requirements. Among the possible customisation directions, a particularly interesting one is that of Instruction-Set Extensions (ISE): Application-specific Functional Units (AFUs) can be added to the processor core in order to speed up a particular application and implement specialised instructions. As these processors become available—e.g., Tensilica Xtensa, ARC ARCtangent, STMicroelectronics ST200, and MIPS CorExtend—techniques are emerging for automatically selecting the best ISEs for an application, given the application source code and under various constraints.
An example of such technique is described in the document US 2007/0162902.
Customisable embedded processors that are available on the market make it possible for designers to speed up execution of applications by using Application-specific Functional Units (AFUs), implementing Instruction-Set Extensions (ISEs). Furthermore, techniques for automatic ISE identification have been improving; many algorithms have been proposed for choosing, given the application's source code, the best ISEs under various constraints. Read and write ports between the AFUs and the processor register file are an expensive asset, fixed in the micro-architecture—some processors indeed only allow two read ports and one write port—and yet, on the other hand, a large availability of inputs and outputs to and from the AFUs exposes high speedup. Here we present a solution to the limitation of actual register file ports by serialising register file access and therefore addressing multi-cycle read and write. It does so in an innovative way for two reasons: (1) it exploits and brings forward the progress in ISE identification under constraint, and (2) it combines register file access serialisation with pipelining in order to obtain the best global solution. Our method consists of scheduling graphs—corresponding to ISEs—under input/output constraint
In the present application, the optimization of microprocessor is achieved with a microprocessor having at least one Application specific Functional Unit (AFU), said AFU implements a part of the functionality of an Instruction Set Extension (ISE), said ISE corresponds to a data flow graph having a plurality of inputs and outputs, said microprocessor having micro-architectural constraints including, but not restricted to: number of register file read ports, number of register file write ports and cycle time, said AFU comprising a set of storage elements and at least one new architectural microprocessor op-code for each ISE.
The invention will be better understood thanks to the attached drawings in which:
the
the
the
the
the
the
the
A particularly expensive asset of the processor core is the number of ports to the register file that the AFUs are allowed to use. While this number is typically kept small in available processors—indeed some only allow two read ports and one write port—it is also true that input/output allowance impacts directly on speedup. A typical trend can be seen in
As a motivational example, consider
We present a method that identifies ISE candidates that exceed the constraint, and then map them on the available I/O by serialising register port access.
Presented is a method for identifying an ISE that recognises the possibility of serialising operand-reading and result-writing of AFUs that exceed the processor I/O constraints. It also presents a method for input/output constrained scheduling that minimises the resulting latency and the number of storage elements for the given latency, of the chosen AFUs by combining pipelining with multi-cycle register file access. Measurements of the obtained speedup show that the proposed method finds high-performance schedules resulting in tangible improvement when compared to the single-cycle register file access case.
Related Work
Discussion of the state of the art is here divided in two parts: the first relates to scheduling and pipelining, while the second details works on automatic Instruction-Set Extension.
A well known unconstrained scheduling for minimum latency is ASAP, while many scheduling algorithms under constraint have been presented, such as resource-constrained and time-constrained. Resource-constrained scheduling limits the number of computational resources that can be used in a cycle; it is an intractable problem, and list scheduling is a heuristic used for solving it. Proposed solutions to time-constrained scheduling, where relative timing constraints between operations are specified, include Force Directed Scheduling and integer linear programming. This paper defines and solves another type of constrained scheduling, called here constrained scheduling, which finds the minimum latency schedule for a DAG under the constraint that no more than Nin inputs and no more than Nout outputs can be read and written in any given cycle. It can be seen as a special case of resource-constrained scheduling. Retiming algorithms are also related to this work, where registers are moved in a circuit in order to optimise performance or area. In particular, a reported algorithm for retiming DAGs is similar to a step of the I/O constrained scheduling algorithm presented here.
The problem of identifying instruction-set extensions consists in detecting clusters of operations which, when implemented as a single complex instruction, maximise some metric—typically performance. Such clusters must invariably satisfy some constraint; for instance, they must produce a single result or use no more than four input values. The problem solved by the algorithms presented in this paper is formalised in Section III, but this generic formulation is used here to discuss related work.
Some methods have been proposed where authors essentially concentrate on targeting maximal reuse of complex instructions. In this case, sequences or simple clusters of operations often appear as the best candidates. The importance of growing larger clusters for high speedup is acknowledged in some recent works. Another recent formulation, experimented on the Nios II processor, uses an exponential enumeration algorithm to find all patterns with a single output; the algorithm is usable in practice in the given micro-architectural context by limiting the number of inputs.
Work on Application Specific Instruction-set Processors (ASIPs) generation is also related to ISE identification, but it differs from the latter because it involves generation of complete instruction sets for specific applications.
The present work combines any ISE identification algorithm that works under constraint with AFU pipelining and I/O constrained scheduling. It recognises the possibility of serialising access to the register file and identifies AFUs with larger I/O constraint than the allowed microarchitectural one; then, it automatically maps them to the actual read/write port availability. To the best of our knowledge, this is the first work that proposes a solution to exploit this possibility in an automatic way.
ISE Selection
Our method is similar in nature to the single-cut identification problem addressed in prior work: we want to find a convex sub-graph S of the Data Flow Graph (DFG) of a basic block. The sub-graph S, which we call cut, represents the functionality to be implemented in a specialised functional unit. The cut S therefore maximises some merit function M(S), which represents the speedup achieved when the cut is implemented as a custom instruction, while input and output nodes of S are such as to allow implementation with a limited number of register-file ports—that is, IN (S)≧Nin and OUT(S)≦Nout, where the constants Nin and Nout depend from the micro-architecture. Finally, S must be a convex graph to guarantee schedulability in typical compilers.
However our method differs from the above problem (disclosed in US2007/0162902) for the following two reasons: (a) the cut S is allowed to have more inputs than the read ports of the register file and/or more outputs than the write ports; if this happens, (b) successive transfers of operands and results to and from the specialised functional unit are accounted for in the latency of the special, instruction. Our method considers (b) while at the same time it introduces pipeline registers, if needed, in the data-path of the unit.
The way we solve the new single-cut identification problem consists of three steps: (1) Best cuts for an application using any ISE identification algorithm (e.g., the single-cut identification described in US2007/0162902) are generated for all possible combinations of input and output counts equal and above Nin and Nout, and below a reasonable upper bound, e.g., 10/5. (2) Both the registers required to pipeline the functional unit under a fixed timing constraint (the cycle time of the host processor) and the registers to store temporarily excess operands and results are added to the DFG of S. In other words, the actual number of inputs and outputs of S are made to fit the micro-architectural constraints. (3) We select the best ones among all cuts. Step (2) is the actual problem that is formalised and solved using the method described here.
Problem Statement
We call S(V, E) the DAG representing the dataflow of a potential special instruction to be implemented in hardware; the nodes V represent primitive operations and the edges E represent data dependencies. Each graph S is associated to a graph
S+(V∪I∪O∪{vin, vout}, E∪E+)
which contains additional nodes I, O, vin, and vout, and edges E+. The additional nodes I and O represent, respectively, input and output variables of the cut. The node vin is called source and has edges to all nodes in I. Similarly, the node vout is the sink and all nodes in O have an edge to it. The additional edges E+ connect the source to the nodes I, the nodes I to V, V to O, and O to the sink.
Each node uεV has associated a positive real weight, λ(u); it represents the latency of the component implementing the corresponding operator. Nodes vin, vout, I, and O have a null weight. Each edge (u,v)εE has an associated positive integer weight, ρ(u,v); it represents the number of registers in series present between the adjacent operators. A null weight on an edge indicates a direct connection (i.e., a wire). Initially all edge weights are null (that is, the cut S is a purely combinatorial circuit).
Our goal is to modify the weights of the edges of S+ in such a way as to have (1) the critical path (maximal latency between inputs and registers, registers and registers, and registers and outputs) below or equal to some desired value Λ, (2) the number of inputs (outputs) to be provided (received) at each cycle below or equal to Nin (Nout), (3) a minimal number of pipeline stages, R. To express this formally, we introduce the sets WI N which contain all edges (vin,u) whose weight ρ(vin,u) is equal to i. Similarly the sets WiOUT contain all edges (u, vout) whose weight ρ(u, vout) is equal to i. We write WiIN to indicate the number of elements in the set WIN. The problem we want to solve is the particular case of scheduling described below.
Problem 1: Minimise R under the following constraints:
1) Pipelining. For all combinatorial paths between uεS+ and vεS+—that is, for all those paths such that: Σall edge (s,t) on the pathρ(s,t)=0;
2) Legality. For all paths between vin and vout,
3) I/O schedulability ∀i≧0
|WiIN|≦Nin and |WiOUT|≦NOUT (3)
The first bullet ensures that the circuit can operate at the given cycle time Λ. The second ensures a legal schedule, that is, a schedule which guarantees that the operands of any given instruction arrive together. The third bullet defines a schedule of communication to and from the functional unit that never exceeds the available register ports: for each edge (vin,u), registers ρ(vin,u) do not represent physical registers, but the schedule used by the processor decoder to access the register file. Similarly, for each (u, vout), ρ(u, vout) indicates when results are to be written back. For this reason, registers on input edges (vin, u) and on output edges (u, vout) will be called pseudo-registers from now on; in all figures, they are shown with a lighter colour than physical registers. As an example,
Method
The method proposed for solving Problem 1 first generates all possible pseudo-registers configurations at the inputs, meaning that pseudo-registers are added on input edges (vin,u) in all ways that satisfy the input schedulability constraint, i.e., |WiIN|≦Nin. This is obtained by repeatedly applying the n choose r problem—or r combinations of an n set—with r=Nin and n=|I|, to the set of input nodes I of S+, until all input variables have been assigned a read-slot—i.e., until all input edges (vin, u) have been assigned a weight ρ(vin,u). Considering only the r combinations ensures that no more than Nin input values are read at the same time. The number of n choose r combinations is
By repeatedly applying n choose r until all inputs have been assigned, the number of total configurations becomes
Note that the complexity of this step is exponential in the number of inputs of the graph, which is a very limited quantity in practical cases (e.g., in the order of tens).
Then, for every input configuration, the algorithm proceeds in 3 steps:
(1) A scheduling pass, described in the pseudocode below, is applied to the graph, visiting nodes in topological order. The algorithm essentially computes an ASAP schedule, but it differs from a general ASAP version because it considers an initial pseudoregister configuration. It is an adaptation of a retiming algorithm for DAGs and its complexity is O(|V|+|E|).
(2) The schedule is now legal at the inputs but not necessarily at the outputs, and some registers might have to be added. The schedule is legal at the output only if at most Nout edges to output nodes have 0 registers (i.e., a weight equal to zero), at most Nout edges to output nodes have a weight equal to 1, and so on. If this is not the case, a line of registers on all output edges is added until the previously mentioned condition is satisfied.
(3) Registers at the outputs are transformed into pseudo-registers (i.e., they are moved to the right of output nodes, on edges (u, vout)), as shown in
All schedules of minimum latency are the ones that solve Problem 1. Among them, a schedule requiring a minimum number of registers is then chosen.
Example of pseudocode of the ASAP algorithm. For every node u, path delay(u) indicates the maximum delay among paths to the node that have no registers, and delay(u) indicates its individual delay, λ. For every edge e, path weight(e) indicates the maximum number of registers from the source node vin to the edge, and weight(e) indicates the number of registers on the edge itself, ρ.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2007/053866 | 9/24/2007 | WO | 00 | 4/5/2010 |
Number | Date | Country | |
---|---|---|---|
60846353 | Sep 2006 | US |