Analyzing diagnostic data generated by multiple threads within an instruction stream

DESCRIPTION OF THE DRAWINGS

The present invention will be described further, by way of example only, with reference to embodiments thereof as illustrated in the accompanying drawings, in which.

FIG. 1 schematically shows an embodiment of the present invention;

FIG. 2 illustrates a data processing apparatus and a diagnostic apparatus according to an embodiment of the present invention;

FIG. 3 schematically shows a system according to an embodiment of the present invention;

FIGS. 4
a and 4b are Hasse diagrams illustrating dependencies between events;

FIGS. 5
a and 5b show a Hasse diagram and corresponding data structure;

FIGS. 6
a and 6b illustrate simpler dependencies in Hasse diagram and data structure;

FIG. 6
c shows a disadvantage of the simplified data structure;

FIG. 7
a to d schematically shows the splitting into separately executable sections of a computer program according to an embodiment of the present invention;

FIG. 8
a to b schematically shows a method of splitting and then merging sections of a computer program;

FIG. 9 schematically shows data communication between two sections of a program;

FIG. 10
a shows a simple computer program annotated according to an embodiment of the present invention;

FIG. 10
b shows the maximal set of threads for the program of FIG. 4a.

FIG. 11 schematically illustrates an asymmetric multiprocessing apparatus with an asymmetric memory hierarchy;

FIG. 12 illustrates an architectural description;

FIG. 13 illustrates a communication requirement; and

FIG. 14 illustrates communication support.

DESCRIPTION OF EMBODIMENTS

FIG. 1 shows very schematically an embodiment of the present invention. This embodiment comprises a system 10 which is a multi-threaded program running on a multiprocessor system. When this system is being traced, a stream of trace data 12 which consists of a complex sequence of events of all the different threads and processes in the system is produced. This trace data is very hard to analyse and thus, the invention uses trace reordering logic 20 to reorder this trace into an order which more closely resembles the order of the original program and as such is easier to understand by a programmer. This reorder data is then input to trace user interface 30.

The trace reordering logic can reorder the trace data in a number of ways. In the embodiments shown, the complex sequence of events that are output from the system being traced comprise synchronisation events which are indicated as capitals in the stream of data 12. These synchronisation events are events where one processor is synchronised with another processor and thus, different threads being processed on different processors are synchronised at this point. The trace reordering logic can utilise this information to reorder the trace to produce an alternative, simpler sequence of events which could legally have been produced by the program. This data is then input to the trace user interface 30 where it may be displayed to the user or compressed to a disc for later replay.

Although, this figure shows a single event stream 12 in practice there may be multiple event streams, for example there may be one for each processor. These may then be merged prior to the trace reordering process or the merging could be performed as part of the reordering process.

Events that are traced can be quite low level events such as individual memory accesses and instruction execution or they can be quite high level events such as the start and stop of remote procedure calls. Data transfer between different processors, inserting/removing an entry from an inter-thread communication channel etc. Events may be generated automatically by hardware or they may be generated in response to operations in the program. For example, a communication library may explicitly generate an event on every send or receive or an operating system may explicitly generate an event on every context switch.

Embodiments of the present invention may also comprise a compiler 40 into which the original program is input to be compiled to run on the multiprocessor system. The compiler parallelises the program to run on the different processors and produces program fragments. It should be noted, that in some embodiments the compiler provides multiple threads from the program which would be suitable to run on a multiprocessor system but may be run on a single processor system. In this embodiment, compiler 40 in addition to producing the parallelised threads or program fragments also produces debug information 42 and dependency information 44. The dependency information indicates additional conditions under which it is legal to reorder events. This information is input to the trace reordering logic 20. By providing more information about the dependencies present in the original program, the trace reordering logic is able to provide further reordering and potentially produce a simpler trace.

The debug information 42 which may also comprise dependency data can be input to the user interface where it can be used when displaying the reordered diagnostic data.

The compiler may also insert trace generating operations into the program to reduce the number of possible re-orderings available. This may allow the trace reordering to run faster or use less memory and it may help to make the reconstructive trace more similar to a trace of the original program.

FIG. 2 shows another embodiment of the present invention which is similar to FIG. 1 but illustrated in a different manner. In this embodiment data processing apparatus 50 comprise a data store 52 in which the source program is stored a compiler 40 and multiple processing unit 60, 62, 64 and 66. The compiler 40 compiles the source program from data store 52 into different threads of execution and these are sent to the multiple processors 60, 62, 64, 66. These each have a trace unit and output trace information to output 54. Output 54 also serves to merge these input streams and outputs a single trace data stream 12 to diagnostic apparatus 70. Output 54 also receives dependency information from the compiler 40 and this is outputs alongside the merged trace data stream as metadata. Diagnostic apparatus 70 comprises a data store or buffer 72 for storing the trace data, reorder logic 20 and a display 30.

FIG. 3 shows an alternative depiction of an embodiment of the present invention. This embodiment shows the input data which is the source program 80 and a description of the system 82. The description of the system 82 comprises a description of the hardware processing unit 90 that processes the program. There is then a compiler 40 which is implemented in software in this embodiment and which compiles the program into program fragments for processing by the various processing units 60, 62 within hardware 90. Each processing unit has its own trace unit 100, 102 which generates the trace data. This trace data is output as a trace data stream 12 to reordering logic 20. This reordering logic may be either hardware or software. This reordering logic also receives programs specific dependency data from the compiler and generic dependency data relating to the system. It may also receive reordering criteria from the user. It uses this data to reorder the trace data to produce a reordered trace which is then sent to user interface 30.

The reordering logic uses different reordering rules to reorder the trace data depending on the embodiment. In some systems, different processors and different threads can run at different rates from each other and the relative rates may be unknown. Thus, any correct program must contain synchronisation between threads and processors whenever the threads communicate. These are synchronisation events. In such systems, we may assume that events in different threads are not ordered with respect to each other unless the two threads directly or indirectly synchronise with each other. This assumption allows a significant amount of trace reordering to be performed. In order to do this, these events are identified as synchronisation events.

Some of the techniques/rules used for reordering depend on the goals of the reordering. For example, if the goal is to gain a high level understanding of the program then the number of context switches, that is switches between different threads may be reduced. Furthermore, the maximum number of simultaneous live threads at any point in the trace should be reduced. Furthermore, the number of events between a causing event and a consequence of that event should also be reduced to make the trace easier to understand. The amount of reordering should be kept to only reordering events where it simplifies the trace.

Although, in some embodiments the trace may be complete for the whole program, in other embodiments it is only a part of the program that is traced.

There are a number of ways that trace reconstruction can be performed. The following describes a simple online algorithm that can be used for reordering trace.

In common with many online algorithms, the event stream can be split into three regions:

- 1. Processed events—these have been reordered by the algorithm.
- 2. The reconstruction “window”—events which are in the process of being reordered. The window may be of fixed or variable size.
- 3. Unobserved events—events that have not yet been considered for reordering.

Events within the window are represented by an appropriate data structure and trace reconstruction consists of repeatedly either adding the next unobserved event to the window or removing an event from the window and adding it to the processed events.

The reordering is restricted by dependencies between events: two events cannot be reordered relative to each other if one must come before another. For example, we write ‘x+y’ to indicate that event ‘a’ must occur before ‘b’ in the reordered trace. This relationship is transitive: if ‘x→y’ and ‘y→z’ then it must be true that ‘x→z’. The usual convention of using a Hasse diagram to represent the → relation is used. For example, the relation ‘a→b’, ‘b→c’, ‘a→c’, ‘b→d’, ‘c→d’, ‘a→d’ is represented by FIG. 4a.

It is assumed that the initial event stream is ordered such that if ‘x→y’ then event ‘x’ comes before event ‘y’.

To add an event ‘x’ to the window, we identify all events ‘e’ already within the window such that ‘e→x’ and add a link from each such event to x. As an transformation, an edge can be omitted or an existing edge deleted if there is an edge from event ‘e’ to some other edge ‘f’ and there is an edge from ‘f’ to ‘x’.

For example, if we add an event ‘a5’ to FIG. 4a and ‘a4→a5’ and ‘c2→a5’, then the window is updated as is depicted in FIG. 4b.

An event ‘y’ can only be removed from the window (and added to the end of the processed event list) if there is no other event ‘x’ within the window such that ‘x→y’. There are often multiple events that can be removed from the window. For example, in the above example, events ‘a1’, ‘b1’ and ‘c1’ can be removed from the window. When there are multiple events that can be removed from the window, heuristics are used to choose the best event to remove.

A number of different data structures may be used to represent the events within the window. An obvious representation is to reflect the ‘→’ relation directly using a directed graph structure. For example, the events and dependencies between them of FIG. 5a can be represented by the data structure of FIG. 5b Instead of representing the → relation exactly, it is also possible to use an approximation. For example, by restricting the complexity of dependencies allowed within the window, it is possible to use FIFO queues. For example, window of FIG. 6A can be represented by the data structure of FIG. 6B.

The rule here is that an event ‘x’ comes before an event ‘y’ in a queue if and only if ‘x→y’. This has the effect of restricting the window to a set of totally ordered events (e.g., all the events generated by a single thread or a single processor). The disadvantage of this simplified representation is that it is not capable of representing some types of dependency. For example, if the event ‘b2’ is added and ‘a2→b2’, then ‘a1’ and a2’ must be removed from the window, see FIG. 6C.

When an event is removed from the window (and added to the list of processed events), there is often a choice of several events that could be removed. In such cases, some simple rules for choosing which event to remove, these are:

- 1. If a removable event ‘e’ is on the same processor as the last event removed, then remove event ‘e’. (This reduces the amount of context switching for the user.)
- 2. If a removable event ‘e’ is on the same thread as the last event removed, then remove event ‘e’. (This reduces the amount of context switching for the user.)
- 3. If a removable event ‘e’ depends on the last event removed (that is, the last event removed was ‘f’ and ‘f→e’), then remove event ‘e’. (This makes causal relationships more obvious).
- 4. Is next in program order
- 5. If a removable event has many events that depend on it, then remove event ‘e’. (This may encourage large clusters of related events to be removed.)
- 6. If a removable event has few events that depend on it, then remove event ‘e’. (This contradicts the previous heuristic and helps reduce the total number of removable events at any one time.)
- 7. Remove the oldest event. (This helps reduce the amount of reordering.)
- 8. Assign each removable event a score based on the degree to which it meets any of the above criteria and choose the event with the highest score.

Some examples of events and dependencies for detailed trace are given below. Some processors support generation of detailed traces consisting of memory accesses and instructions. For such processors, appropriate dependencies might be:

- Instructions executed on the same thread are totally ordered: if instruction i1 is executed before instruction i2, then i1→i2. (On a superscalar processor, we mean that instruction i1 occurs before i2 in program order or is retired before i2.)
- If one thread waits for an event to happen and another thread signals that event, then there is a dependency from the signal instructions to the wait instructions.
- If an instruction i2 in one thread reads from a memory location previously written to by an instruction i1 in another thread, there is a dependency from the instruction that wrote to the memory location. In many cases, these dependencies should only be considered if the write is guaranteed to be observable—for example, if a memory barrier operation or a thread-synchronization operation has been used. Where the write is not guaranteed to be observable, it may be appropriate to report a possible error.

Some examples of events and dependencies for coarse-grained trace are given below.

Some systems may generate much coarser-grained traces consisting of when a given task (e.g., computing a result) starts and stops and when threads/processors communicate/synchronize with each other. For such systems, appropriate dependencies might be:

- Tasks executed on the same thread are totally ordered: if task t1 is executed before task t2, then t1→t2.

If completion of one task ‘t1’ triggers another task ‘t2’ to start, then ‘t1→t2’.

Communication between threads can be ‘direct’ or ‘buffered’. In direct communication, if a thread ‘A’ sends a value to a thread ‘B’, the next value read by ‘B’ will be the last value written by ‘A’. In such communication, there is a dependency from the event associated with sending a value to the next event associated with receiving a value. In buffered communication, if a thread ‘A’ sends a value to a thread ‘B’, the next value read by ‘B’ will be one of the values previously written sent by ‘A’. For example, when using a FIFO channel between threads, values are received by B in the order they are sent by A. In such communication, there is a dependency from the event associated with sending a value to the event associated with receiving that particular value from the channel.

There is often an overhead associated with generating events so it is desirable to reduce the amount of trace generated. This can be done by using coarser-grained trace but it can also be done by generating less detail in the trace and/or by exploiting the semantics of the operations generating the trace. For example, to model a buffered communication precisely, one must include enough information in an event so that the event caused by reading a particular value can be matched with the event that wrote that particular value. The amount of trace can be reduced by omitting this information in which case, all receive events are considered to be dependent on all preceding send events. Alternatively, if the communication is through a FIFO channel, the amount of trace can be reduced by exploiting the fact that values are received in the same order as they are written so the first send to that channel can be paired with the first receive from the channel, etc.

If trace is generated by executing instructions which emit events into the trace, it can be convenient to emit trace in communication and scheduling libraries. For example, the scheduler can emit an event on every context switch and communication libraries can emit events before sending an event, after sending an event, before receiving an event and after receiving an event.

It may also be convenient to generate trace in parallelising compilers. If a compiler automatically parallelizes a program, the compiler can insert instructions to emit trace events to allow more complete reconstruction of the trace. If the program is split into multiple threads and the threads frequently communicate with each other, it is often sufficient to emit events when the threads communicate which can be done using a communication library that emits events as previously described. If the threads communicate infrequently, it may be desirable for the compiler to emit additional events. For example, if the original program is:

for(int i=0; i<100; ++i) {

f( );

g( );

}

and the compiler splits this program into two threads that do not communicate with each other:

for(int i=0; i<100; ++i) {

f( );

}

and

for(int i=0; i<100; ++i) {

g( );

}

then there may not be enough information to reconstruct a properly synchronized trace showing alternating calls to functions ‘f’ and ‘g’. In this case, the compiler can insert instructions to emit an event after invoking ‘f’ and before invoking ‘g’ resulting in threads:

for(int i=0; i<100; ++i) {

f( );

emit(F);

}

and

for(int i=0; i<100; ++i) {

emit(G);

g( );

}

It should be noted that although the above described techniques are particularly useful for multiprocessor SoCs, they can also be applied to traces of VLIW execution (parallelism between functional units) and to traces of remote execution (parallelism between computers).

In summary embodiments of the invention that are able to modify the parallelizing compiler, may insert enough trace generation hints so that, instead of an arbitrary total order, a total order corresponding to the original sequential program is generated.

If an embodiment is not able to insert all the trace generation hints needed to achieve a total order, our reconstruction algorithm uses heuristics to reduce the number of context switches in the trace.

To achieve complete reconstruction, it helps if the parallelizing compiler inserts hints in the code that make it easier to match up corresponding parts of the program. In the absence of explicit hints, it may be possible to obtain full reconstruction using debug information to match parts of the program.

When there are no explicit hints or debug information, partial reconstruction can be achieved by using points in the program that synchronize with each other to guide the matching process. The resulting trace will not be sequential but will be easier to understand. A useful application is to make it simpler to understand a trace of a program written using an event-based programming style (e.g., a GUI, interrupt handlers, device drivers, etc.)

Partial reconstruction could also be used to simplify parallel programs running on systems that use release consistency. Such programs must use explicit memory barriers at all synchronization points so it will be possible to simplify traces to reduce the degree of parallelism the programmer must consider.

One simple case of this is reconstructing a ‘message passing’ view of bus traffic.

In summary, reordering of trace can make use of multiple sources of information including (in approximately increasing order of specificity):

1) Information about the compiler which is independent of the compilation of the program such as:

a) The style of parallelization used

b) Naming conventions used for variables/functions introduced during parallelization.

c) How programs are instrumented to produce trace (e.g., a trace event might be generated at the end of every loop).

2) Information about how a particular program was compiled such as:

a) Which sections of code were parallelized.

b) Where threads and communication/synchronization between threads was introduced.

c) How sections of code in the original program relate to sections of code in the parallelized program, e.g., line 23 of the original program might correspond to line 45 in the parallelized program.

d) How variables in the original program relate to variables in the parallelized program, e.g. a variable ‘x’ in the original program might have been split into two parts ‘x1’ and ‘x2’ in the parallelized program

e) What instrumentation has been introduced into this program e.g., an event might be generated indicating how many times a particular loop executed.

3) Information about this particular execution. This primarily consists of the trace data but might also include information about which processors executed particular threads (say).

4) User preferences.

Details of further techniques are given below.

FIG. 7
a shows a portion of a computer program comprising a loop in which data items are processed, function f operating on the data items, and function g operating on the data items output by function f and then function h operating on these items. These functions being performed n times in a row for values of i from 1 to n.

Thus, the control flow can be seen as following the solid arrows while data flow follows the dotted arrows. In order to try to parallelise this portion of the computer program it is analysed, either automatically or by a programmer and “decouple” indications are inserted into the data flow where it is seen as being desirable to split the portion into sections that are decoupled from each other and can thus, be executed on separate execution mechanisms. In this case, a decouple indication is provided between the data processing operations f and g. This can be seen as being equivalent to inserting a buffer in the data flow, as the two sections are decoupled by providing a data store between then so that the function f can produce its results which can then be accessed at a different time by function g.

FIG. 7
c, shows how the program is amended to enable this decoupling by the insertion of “put” and “get” instructions into the data stream. These result in the data being generated by the f function being put into a data store, from which it is retrieved by the get instruction to be processed by function g. This enables the program to be split into two sections as is shown in FIG. 7d. One section performs function f on the data for i=1 to n and puts it into a buffer data store. The other section then retrieves this data and performs functions g and h on it. Thus, by the provision of a data store the two sections of the program are in effect decoupled from each other and can be executed on separate executions mechanisms. This decoupling by the use of a specialised buffer and extra instructions to write and read data to it, are only required for systems having heterogeneous memory, whereby two execution mechanisms may not be able to access the same memory. If the memory is shared, then the data path between the two sections does not need a data copy but can simply be the provision of a data store identifier. Thus, if the program is being processed by a data processing apparatus having a number of different processors, the two sections can be processed in parallel which can improve the performance of the apparatus. Alternatively, one of the functions may be a function suitable for processing by an accelerator in which case it can be directed to an accelerator, while the other portion is processed by say, the CPU of the apparatus.

As can be seen from FIG. 7d, the splitting of the program results in the control code of the program being duplicated in both section, while the data processing code is different in each section.

It should be noted that the put and get operations used in FIG. 7c can be used in programs both for scalar and non-scalar values but they are inefficient for large (non-scalar) values as they require a memory copy. In operating systems, it is conventional to use “zero copy” interfaces for bulk data transfer: instead of generating data into one buffer and then copying the data to the final destination, the final destination is first determined and the data directly generated into the final destination. A different embodiment of the invention applies this idea to the channel interface, by replacing the simple ‘put’ operation with two functions: put_begin obtains the address of the next free buffer in the channel and put_end makes this buffer available to readers of the channel:

void* put_begin(channel *ch);

void put_end(channel *ch, void* buf);

Similarly, the get operation is split into a get_begin and get_end pair

void* get_begin(channel *ch);

void get_end(channel *ch, void* buf);

Using these operations, sequences of code such as:

int x[100];

generate(x);

put(ch,x);

Can be rewritten to this more efficient sequence:

int px=put_begin(ch);

generate(px);

put_end(ch,px);

And similarly, for get:

int x[100];

get(ch,x);

consume(x);

to this more efficient sequence:

int px=get_begin(ch);

consume(px);

get_end(ch,px);

The use of puts and gets to decouple threads can be further extended to use where communication between threads is cyclic. Cyclic thread dependencies can lead to “Loss of Decoupling”—that is, two threads may not run in parallel because of data dependencies between them and thus, in devices of the prior art decoupling is generally limited to acyclic thread dependencies.

1. A particularly common case of cyclic thread dependencies is code such as

y = 1;

while(1) {

x = f(y);

y = g(x);

}

Under conventional decoupling schemes, puts are always inserted after assignment to any data boundary variable. This would require both a put outside the loop and a put at the end of the loop:

y1 = 1;

put(ch,y1);

while(1) {

y2 = get(ch);

x = f(y2);

y3 = g(x);

put(ch,y3);

}

Conventional decoupling schemes only generate matched pairs of puts and gets (i.e., there is only one put on each channel and only one get on each channel) so they cannot generate such code.

Embodiments of the present invention use an alternative way of decoupling this code and generate:

y1 = 1;

while(1) {

put(ch,y1);

y2 = get(ch);

x = f(y2);

y1 = g(x);

}

This does have matched pairs of puts and gets but breaks the rule of always performing a put after any assignment to a variable.

FIGS. 8
a and 8b schematically illustrate the program code shown in FIG. 7. In this Figure a data store is provided to decouple functions f and g, but one is not provided between g and h. In this embodiment analysis of the program to decouple it is performed automatically and several potential sections are provided, in this case these are loops having functions f, g and h in them. The automatic analysis then checks that each loop can be executed separately and in this case identifies a missing data path between functions g and h. Thus, these two functions are remerged to provide two sections with a data path between.

FIG. 9 shows in more detail the data path between the two program sections. As can be seen in this figure, it is a data array that is transferred, that is the data from the whole loop that is transferred in a single transaction. This is clearly advantageous compared to transferring data for each pass in the loop. In particular, by parallelizing at a coarse granularity, the need for low latency, high throughput communication mechanisms such as those used in prior art finer granularity devices are reduced.

Furthermore, parallelizing at a significantly coarser granularity also allows the duplication of more control code between threads which reduces and simplifies inter-thread communication allowing the generation of distributed schedules. That is, we can distribute the control code across multiple processors both by putting each control thread on a different processor and by putting different parts of a single control thread onto different processors.

The transfer of data may be done by, writing the data to a particular buffer such as a FIFO. Alternatively it may simply be done by providing the other section of the program with information as to where the data has been stored.

The way of transferring the data depends on the system the program is executing on. In particular, if the architecture does not have shared memory, it is necessary to insert DMA copies from a buffer in one memory to a buffer in a different memory. This can lead to a lot of changes in the code: declaring both buffers, performing the copy, etc. In embodiments of the invention an analysis is performed to determine which buffers need to be replicated in multiple memory regions and to determine exactly which form of copy should be used. DMA copies are also inserted automatically subject to some heuristics when the benefit from having the programmer make the decision themselves is too small.

Systems with multiple local memories often have tight memory requirements which are exacerbated by allocating a copy of a buffer in multiple memories. The analysis takes account of this and seeks to reduce the memory requirement by overlapping buffers in a single memory when they are never simultaneously live.

It should be noted that although in some programs it may be appropriate to provide a FIFO type data store between the sections, in others it may be that the section requiring the data does not require it in a particular order, or it may not require all of the data. This can be provided for by varying the way the data is passed between the sections.

FIG. 10
a shows a simple computer program annotated according to an embodiment of the present invention. An analysis of this program is performed initially and parts of the program are identified by programmer annotation in this embodiment although it could be identified by some other analysis including static analysis, profile driven feedback, etc. The parts identified are as follows:

What can be regarded as the “decoupling scope”. This is a contiguous sequence of code that we wish to split into multiple threads.

The “replicatable objects”: that is variables and operations which it is acceptable to replicate. A simple rule of thumb is that scalar variables (i.e., not arrays) which are not used outside the scope, scalar operations which only depend on and only modify replicatable variables, and control flow operations should be replicated but more sophisticated policies are possible.

Ordering dependencies between different operations: if two function calls both modify a non-replicated variable, the order of those two function calls is preserved in the decoupled code. (Extensions to the basic algorithm allow this requirement to be relaxed in various ways.)

The “data boundaries” between threads: that is, the non-replicatable variables which will become FIFO channels. (The “copies” data annotation described above determines the number of entries in the FIFO.)

This degree of annotation is fine for examples but would be excessive in practice so most real embodiments would rely on tools to add the annotations automatically based on heuristics and/or analyses.

The particular decoupling algorithm used generates a maximal set of threads such that the following properties hold:

- All threads have the same control flow structure and may have copies of the replicatable variables and operations.
- Each non-replicatable operation is included in only one of the threads.
- Each non-replicatable variable must satisfy one of the following:
  - The only accesses to the variable in the original program are reads; or
  - All reads and writes to the variable are in a single thread; or
  - The variable was marked as a data boundary and all reads are in one thread and all writes are in another thread.
- If two operations have an ordering dependency between them which is not due to a read after write (RAW) dependency on a variable which has been marked as a data boundary, then the operations must be in the same thread.

FIG. 10
b shows the maximal set of threads for the program of FIG. 10a. One way to generate the set of threads shown in FIG. 10b is as follows:

- 1. For each non-replicatable operation, create a ‘protothread’ consisting of just that operation plus a copy of all the replicatable operations and variables. Each replicatable variable must be initialized at the start of each thread with the value of the original variable before entering the scope and one of the copies of each replicatable variable should be copied back into the master copy on leaving the scope. (Executing all these protothreads is highly unlikely to give the same answer as the original program, because it lacks the necessary synchronization between threads. This is fixed by the next steps.)
- 2. Repeatedly pick two threads and merge them into a single thread if any of the following problems exist:
  - a. One thread writes a non-replicatable variable which is accessed (read or written) by the other thread and the variable is not marked as a data boundary.
  - b. Two threads both write to a variable which is marked as a data boundary.
  - c. Two threads both read from a variable that is marked as a data boundary.
  - d. There is an ordering dependency between an operation in one thread and an operation in the other thread which is not a RAW dependency on a variable marked as a data boundary.
- 3. When no more threads can be merged, quit

Another way is to pick an operation, identify all the operations which must be in the same thread as that operation by repeatedly adding operations which would be merged (in step 2 above). Then pick the next operation not yet assigned to a thread and add all operations which must be in the same thread as that operation. Repeat until there are no more non-replicatable operations. It should be noted that this is just one possible way of tackling this problem: basically, we are forming equivalence classes based on a partial order and there are many other known ways to do this.

The above method splits a program into a number of sections which can be executed in parallel. There are many possible mechanisms that can be used to accomplish this task.

FIG. 11 schematically illustrates an asymmetric multiprocessor apparatus comprising a first execution mechanism 100 and a second execution mechanism 102. An asymmetric memory hierarchy within the system comprises a cache memory 104 connected to the first execution mechanism 100 and a shared memory 106 connected to both the first execution mechanism 100 and the second execution mechanism 102 via the cache memory 104. It will be appreciated that FIG. 11 illustrates a highly simplified system, but this is nevertheless asymmetric, contains an asymmetric memory hierarchy and would represent some level of difficulty in deciding which sections of a source program should execute on which execution mechanism 100, 102 and how the data should be partitioned between the different elements of the memory hierarchy 104, 106 (e.g. which data items used by the first processor 100 should be made cacheable and which non-cacheable).

FIG. 12 schematically illustrates an at least partial architectural description of the system of FIG. 11. This partial architectural description is in the style of the Spirit format and specifies which components are present and the interconnections between those components. It will be appreciated that in practice a Spirit architectural description will typically contain considerably more detail and information concerning the nature and interconnections of the various elements within the system. Nevertheless, this basic information as to which elements are present and how they are connected is used by a computer implemented method for transforming a source computer program into a transformed computer program for distributed execution on the system of FIG. 11.

FIG. 13 gives an example of a communication requirement which can be identified within a source computer program. This communication requirement is a Move instruction. This Move instruction is moving a variable A being manipulated within the first execution mechanism 100 (PE0) to the second execution mechanism 102 (PE1). Having identified this communication requirement, the architectural description of the system as given in FIG. 12 can be used to identify that an appropriate set of communication supporting operations need to be added to the code and include those illustrated, i.e. forming a MemoryBarrier on PE0, cleaning the variable A from the cache of PE0 and then loading the variable A from the memory 106 into the processor PE1. This is a considerably simplified example, but nevertheless illustrates the identification of a communication requirement followed by the associated communication support.

FIG. 14 schematically illustrates a section of source computer program including data placement tags and processing placement tags of the type described elsewhere herein. In particular, in respect of the data element char x[1000], a data placement tag is associated with the source computer program (in this particular example added to it) indicating that this data element should be stored within a memory MEM1. This information is used by the computer implemented method which maps portions of the source code to different execution mechanisms and compiles or configures those portions appropriately.

Also illustrated in FIG. 14 are two programming functions foo(x) and bar(x). It will be appreciated that these functions may represent complex sequences of instructions in their own right. The processing placement tags associated with each of these functions indicates where that function is to be executed. As an example, the function foo could be a general purpose control function and this is most appropriately performed using a general purpose processor PE0. Conversely the function bar may be a highly specialised FFT task or other specific function for which there is provided a specific accelerator in the form of the execution mechanism PE1 and accordingly it is appropriate to specify that this function should be executed on that particular execution mechanism.

The following describes language extensions/annotations, compilation tools, analysis tools, debug/profiling tools, runtime libraries and visualization tools to help programmers program complex multiprocessor systems. It is primarily aimed at programming complex SoCs which contain heterogeneous parallelism (CPUs, DEs, DSPs, programmable accelerators, fixed-function accelerators and DMA engines) and irregular memory hierarchies.

The compilation tools can take a program that is—either sequential or contains few threads and map it onto the available hardware, introducing parallelism in the process. When the program is executed, we can exploit the fact that we know mappings between the user's program and what is executing to efficiently present a debug and profile experience close to what the programmer expects while still giving the benefit of using the parallel hardware. We can also exploit the high level view of the overall system to test the system more thoroughly, or to abstract away details that do not matter for some views of the system.

This provides a way of providing a full view for SoC programming.
Overview

The task of programming a SoC is to map different parts of an application onto different parts of the hardware. In particular, blocks of code must be mapped onto processors, data engines, accelerators, etc. and data must be mapped onto various memories. In a heterogeneous system, we may need to write several versions of each kernel (each optimized for a different processor) and some blocks of code may be implemented by a fixed—function accelerator with the same semantics as the code.

The mapping process is both tedious and error-prone because the mappings must be consistent with each other and with the capabilities of the hardware. We reduce these problems using program analysis which:

- detect errors in the mapping
- infer what mappings would be legal
- choose legal mappings automatically subject to some heuristics
  
  The number of legal mappings is usually large but once the programmer has made a few choices, the number of legal options usually drops significantly so it is feasible to ask the programmer to make a few key choices and then have the tool fill in the less obvious choices automatically.
  
  Often the code needs minor changes to allow some mappings. In particular, if the architecture does not have shared memory, it is necessary to insert DMA copies from a buffer in one memory to a buffer in a different memory buffer. This leads to a lot of changes in the code: declaring both buffers, performing the copy, etc. Our compiler performs an analysis to determine which buffers need to be replicated in multiple memory regions and to determine exactly which form of copy should be used. It also inserts DMA copies automatically subject to some heuristics when the benefit from having the programmer make the decision themselves is too small.
  
  Systems with multiple local memories often have tight memory requirements which are exacerbated by allocating a copy of a buffer in multiple memories. Our compiler uses lifetime analysis and heuristics to reduce the memory requirement by overlapping buffers in a single memory when they are never simultaneously live.
  
  Programmable accelerators may have limited program memory so it is desirable to upload new code while old code is running. For correctness, we must guarantee that the new code is uploaded (and I-caches made consistent) before we start running it. Our compiler uses program analysis to check this and/or to schedule uploading of code at appropriate places.

For applications with highly variable load, it is desirable to have multiple mappings of an application and to switch dynamically between different mappings.

Some features of our approach are:

- Using an architecture description to derive the ‘rules’ for what code can execute where. In particular, we use the type of each processor and the memories attached to each processor.
- The use of program analysis together with the architecture description to detect inconsistent mappings.
- Using our ability to detect inconsistent mappings to narrow down the list of consistent mappings to reduce the number of (redundant) decisions that the programmer has to make.
- Selecting an appropriate copy of a buffer according to which processor is using it and inserting appropriate DMA copy operations.
- Use of lifetime analyses and heuristics to reduce memory usage due to having multiple copies of a buffer.
- Dynamic switching of mappings.

Annotations to Specify Mappings

To describe this idea further, we need some syntax for annotations. Here we provide one embodiment of annotations which provide the semantics we want.

In this document, all annotations take the form:

. . . @ {tag1=>value1, . . . tagm=>value}

Or, when there is just one tag and it is obvious,

. . . @ value

The primary annotations are on data and on code. If a tag is repeated, it indicates alternative mappings.

The tags associated with data include:

- {memory=>“bank3”} specifies which region of memory a variable is declared in.
- {copies=>2} specifies that a variable is double buffered
- {processor=>“P1 ”} specifies that a variable is in a region of memory accessible by processor P1.
  
  For example, the annotation:

int x[100] @ {memory=>“bank3”, copies=>2, memory=>“bank4”, copies=>1} indicates that there are 3 alternative mappings of the array x: two in memory bank3 and one in memory bank4.

The tags associated with code include:

- {processor=>“P1 ”} specifies which processor the code is to run on
- {priority=>5} specifies the priority with which that code should run relative to other code running on the same processor
- {atomic=>true} specifies that the code is to run without pre-emption.
- {runtime=>“<=10 ms”} specifies that the code must be able to run in less than 10 milliseconds on that processor. This is one method used to guide automatic system mapping.
  
  For example, the annotation:

{fir(x); fft(x,y);} @ {processor=>“P1”}

Specifies that processor P1 is to execute fft followed by P1. The semantics is similar to that of a synchronous remote procedure call: when control reaches this code, free variables are marshalled and sent to processor P1, processor P1 starts executing the code and the program continues when the code finishes executing.

It is not always desirable to have synchronous RPC behaviour. It is possible to implement asynchronous RPCs using this primitive either by executing mapped code in a separate thread or by splitting each call into two parts: one which signals the start and one which signals completion.

The tags associated with functions are:

- {cpu=>“AR1DE”} specifies that this version of an algorithm can be run on a processor/accelerator of type “AR1DE”
- {flags=>“-O3”} specifies compiler options that should be used when compiling this function
- {implements=>“fir”} specifies that this version of an algorithm can be used as a drop in replacement for another function in the system
  
  For example, the annotation:

Void copy_DMA(void* src, void* tgt, unsigned length) @ {cpu=>“PL081”, implements=>“copy”};

Specifies that this function runs on a PL081 accelerator (a DMA Primesys engine) and can be used whenever a call to “copy” is mapped to a PL081 accelerator.
Extracting Architectural Rules from the Architectural Description

There are a variety of languages for describing hardware architectures including the SPIRIT language and ARM SoCDesigner's internal language. While the languages differ in syntax, they share the property that we can extract information such as the following:

- The address mapping of each processor. That is, which elements of each each memory region and which peripheral device registers are accessed at each address in the address and I/O space. A special case of this is being able to detect that a component cannot address a particular memory region at all.
- The type of each component including any particular attributes such as cache size or type.
- That a processor's load-store unit, a bus, a combination of buses in parallel with each other, a memory controller or the address mapping makes it possible for accesses to two addresses that map to the same component or to different components from one processor to be seen in a different order by another processor. That is, the processors are not sequentially consistent with respect to some memory accesses.
- That a combination of load-store units, caches, buffers in buses, memory controllers, etc. makes it possible for writes by one processor to the same memory location to suffer from coherency problems wrt another processor for certain address ranges.
  
  Thus, from the architecture, we can detect both address maps which can be used to fill in fine details of the mapping process and we can detect problems such as connectivity, sequential consistency and incoherence that can affect the correctness of a mapping.

Detecting Errors in a System Mapping

Based on rules detected in an architectural description and/or rules from other sources, we can analyse both sequential and parallel programs to detect errors in the mapping. Some examples:

- If a piece of code is mapped to a processor P and that code reads or writes data mapped to a memory M and P cannot access M, then there is an error in the mapping.
- If two pieces of code mapped to processors P1 and P2 both access the same variable x (e.g. P1 writes to x and P2 reads from x), then any write by P1 that can be observed by a read by P2 must:
  - have some synchronization between P1 and P2
  - be coherent (e.g., there may need to be a cache flush by P1 before the synchronization and a cache invalidate by P2 after the synchronization)
  - be sequentially consistent (e.g., there may need to be a memory barrier by P1 before the synchronization and a memory barrier by P2 after the synchronization)
  - share memory (e.g., it may be necessary to insert one or more copy operations (by DMA engines or by other processors/accelerators) to transfer the data from one copy of x to the other.
- Synchronization and signalling can be checked
- Timing and bandwidth can be checked
- Processor capability can be checked: a DMA engine probably cannot play Pacman
- Processor speed can be checked: a processor may not be fast enough to meet certain deadlines.
- Etc.

Thus, we can check the mapping of a software system against the hardware system it is to run on based on a specification of the architecture or additional information obtained in different ways.
Filling in Details and Correcting Errors in a System Mapping

Having detected errors in a system mapping, there are a variety of responses. An error such as mapping a piece of code to a fixed-function accelerator that does not support that function should probably just be reported as an error that the programmer must fix. Errors such as omitting synchronization can sometimes be fixed by automatically inserting synchronization. Errors such as requiring more variables to a memory bank than will fit can be solved, to some extent, using overlay techniques. Errors such as mapping an overly large variable to a memory can be resolved using software managed paging though this may need hardware support or require that the kernel be compiled with software paging turned on (note: software paging is fairly unusual so we have to implement it before we can turn it on!). Errors such as omitting memory barriers, cache flush/invalidate operations or DMA transfers can always be fixed automatically though it can require heuristics to insert them efficiently and, in some cases, it is more appropriate to request that the programmer fix the problem themselves.

Overview

Given a program that has been mapped to the hardware, the precise way that the code is compiled depends on details of the hardware architecture. In particular, it depends on whether two communicating processors have a coherent and sequentially consistent view of a memory through which they are passing data.

Communication Glue Code

Our compiler uses information about the SoC architecture, extracted from the architecture description, to determine how to implement the communication requirements specified within the program. This enables it to generate the glue code necessary for communication to occur efficiently and correctly. This can include generation of memory barriers, cache maintenance operations, DMA transfers and synchronisation on different processing elements.

This automation reduces programming complexity, increases reliability and flexibility, and provides a useful mechanism for extended debugging options.
Communication Error Checking

Other manual and automatic factors may be used to influence the communication mechanism decisions. Errors and warnings within communication mappings can be found using information derived from the architecture description.

SUMMARY

Some features of our approach are:

- Detecting coherence and consistency problems of communication requirements from a hardware description.
- Automatically inserting memory barriers, cache maintenance, DMA transfers etc. to fix coherence/consistency problems into remote procedure call stubs (i.e., the “glue code”) based on above.
  
  We take the concept of Remote Procedure Calls (RPCs) which are familiar on fully programmable processors communicating over a network, and adapt and develop it for application in the context of a SoC: processors communicating over a bus with fixed function, programmable accelerators and data engines.
  
  Expressing execution of code on other processing elements or invocation of accelerators as RPCs gives a function based model for programmers, separating the function from the execution-mechanism. This enables greater flexibility and scope for automation and optimisation.

RPC Abstraction

An RPC abstraction can be expressed as functions mapped to particular execution mechanisms:

main( ) {

foo( );

foo( ) @ {processor => p2};

}

This provides a simple mechanism to express invocation of functions, and the associated resourcing, communication and synchronisation requirements.

Code can be translated to target the selected processing elements, providing the associated synchronisation and communication. For example, this could include checking the resource is free, configuring it, starting it and copying the results on completion. The compiler can select appropriate glue mechanisms based on the source and target of the function call. For example, an accelerator is likely to be invoked primarily by glue on a processor using a mechanism specific to the accelerator.

The glue code may be generated automatically based on a high level description of the accelerator or the programmer may write one or more pieces of glue by hand.

The choice of processor on which the operation runs can be determined statically or can be determined dynamically. For example, if there are two identical DMA engines, one might indicate that the operation can be mapped onto either engine depending on which is available first.

The compiler optimisations based on the desired RPC interface can range from a dynamically linked interface to inter-procedural specialisation of the particular RPC interface
RPC Semantics

RPC calls may be synchronous or asynchronous. Asynchronous calls naturally introduce parallelism, while synchronous calls are useful as a simpler function call model, and may be used in conjunction with fork-join parallelism. In fact, parallelism is not necessary for efficiency; a synchronous call alone can get the majority of the gain when targeting accelerators. Manually and automatically selecting between asynchronous and synchronous options can benefit debugging, tracing and optimisation.

RPC calls may be re-entrant or non-reentrant, and these decisions can be made implicitly, explicitly or through program analysis to provide benefit such as optimisation where appropriate.
RPC Debugging

This mechanism enables a particular function to have a number of different execution targets within a program, but each of those targets can be associated back to the original function; debugging and trace can exploit this information. This enables a user to set a breakpoint on a particular function, and the debug and trace mechanisms be arranged such that it can be caught wherever it executes, or on a restricted subset (e.g. a particular processing element).

The details of the RPC interface implementation can be abstracted away in some debugging views.
SUMMARY

Some features of our approach are:

- Using an RPC-like approach for mapping functions on to programmable and fixed function accelerators, including multiple variants.
- Providing mechanisms for directing mapping and generation of the marshalling and synchronisation to achieve it.
- Optimising the RPC code based-on inter-procedural and program analysis.
- Providing-debug functionality based on information from the RPC abstraction and the final function implementations.

Overview

Increasingly, applications are being built using libraries which define datatypes and a set of operations on those types. The datatypes are often bulk datastructures such as arrays of data, multimedia data, signal processing data, network packets, etc. and the operations may be executed with some degree of parallelism on a coprocessor, DSP processor, accelerator, etc. It is therefore possible to view programs as a series of often quite coarse-grained operations applied to quite large data structures instead of the conventional view of a program as a sequence of ‘scalar’ operations (like ‘32 bit add’) applied to ‘scalar’ values like 32-bit integers or the small sets of values found in SIMD within a register (SWAR) processing such as that found in NEON. It is also advantageous to do so because this coarse-grained view can be a good match for accelerators found in modern SoCs.

We observe that with some non-trivial adaptation and some additional observations, optimization techniques known to work on fine-grained operations and data can be adapted to operate on coarse-grained operations and data.
Our compiler understands the semantics associated with the data structures and their use within the system, and can manipulate them and the program to perform transformations and optimisations to enable and optimise execution of the program.
Conventional Analyses and their Extension

Most optimizing compilers perform a dataflow analysis prior to optimization. For example, section 10.5 of Aho Sethi and Ullman's ‘Compilers: Principles Techniques and Tools’, published by Addison Wesley, 1986, ISBN: 0-201-10194-7 describes dataflow analysis. The dataflow analysis is restricted to scalar values: those that fit in a single CPU register. Two parts of a dataflow analysis are:

- identifying the dataflow through individual operations
- combining the dataflow analysis with a control-flow analysis to determine the dataflow from one program point to another.
  
  In order to use dataflow analysis techniques with coarse-grained dataflow, we modify the first part so that instead of identifying the effect of a single instruction on a single element, we identify the effect of a coarse-grained operation (e.g., a function call or coprocessor invocation) on an entire data structure in terms of whether the operation is a ‘use’, a ‘def’ or a ‘kill’ of the value in a data structure. Care must be taken if an operation modifies only half of an array since the operation does not completely kill the value of the array.
  
  For operations implemented in hardware or in software, this might be generated automatically from a precise description of the operation (including the implementation of the operation) or it might be generated from an approximate description of the main effects of the operation or it might be provided as a direct annotation.
  
  In particular, for software, these coarse-grained operations often consist of a simple combination of nested loops and we can analyze the code to show that the operation writes to an entire array and therefore ‘kills’ the old value in the array. In scalar analysis, this is trivial since any write necessarily kills the entire old value.

The following sections identify some of the uses of coarse-grained dataflow analysis
Multiple Versions of the same Buffer

Especially when writing parallel programs or when using I/O devices and when dealing with complex memory hierarchies, it is necessary to allocate multiple identically sized buffers and copy between the different buffers (or use memory remapping hardware to achieve the effect of a copy). We propose that in many cases these multiple buffers can be viewed as alternative versions of a single, logical variable. It is possible to detect this situation in a program with multiple buffers, or the programmer can identify the situation. One way the programmer can identify the situation is to declare a single variable and then use annotations to specify that the variable lives in multiple places or the programmer could declare multiple variables and use annotations to specify that they are the same logical variable. However the different buffers are identified as being one logical variable, the advantages that can be obtained include:

- more intelligent buffer allocation
- detecting errors where one version is updated and that change is not propagated to other version before it is used
- debug, trace and profile tools can treat a variable as one logical entity so that, for example, if programmer sets watchpoint on x then tools watch for changes on any version of x. Likewise, if compiler has put x and y in the same memory location (following liveness analysis), then the programmer will only be informed about a write to x when that memory location is being used to store x, not when it is being used to store y. When doing this, you might well want to omit writes to a variable which exist only to preserve the multi-version illusion. For example, if one accelerator writes to version 1, then a dma copies version 1 to version 2, then another accelerator modifies the variable, then the programmer will often not be interested in the dma copy.
  
  We note that compilers do something similar for scalar variables: the value of a scalar variable ‘x’ might sometimes live on the stack, sometimes in register 3, sometimes in register 6, etc. and the compiler keeps track of which copies currently contain the live value.

Allocation

By performing a liveness analysis of the data structures, the compiler can provide improved memory allocation through memory reuse because it can identify opportunities to place two different variables in the same memory location. Indeed, one can use many algorithms normally used for register allocation (where the registers contain scalar values) to perform allocation of data structures. One modification required is that one must handle the varying size of buffers whereas, typically, all scalar registers are the same size.

Scheduling

One thing that can increase memory use is having many variables simultaneously live. It has been known for a long time that you can reduce the number of scalar registers required by a piece of code by reordering the scalar operations so that less variables are simultaneously live.

Using a coarse-grained dataflow analysis, one can identify the lifetime of each coarse-grained data structure and then reorder code to reduce the number of simultaneously live variables. One can even choose to recalculate the value of some data structure because it is cheaper to recalculate it than to remember its value.

When parallelising programs, one can also deliberately choose to restrain the degree of parallelism to reduce the number of simultaneously live values. Various ways to restrain the parallelism exist: forcing two operations into the same thread, using mutexes/semaphores to block one thread if another is using a lot of resource, tweaking priorities or other scheduler parameters.

If a processor/accelerator has a limited amount of available memory, performing a context switch on that processor can be challenging. Context switching memory-allocated variables used by that processor solves the problem.

Optimisation

Compiler books list many other standard transformations that can be performed to scalar code. Some of the mapping and optimisation techniques that can be applied at the coarse-grain we discuss include value splitting, spilling, coalescing, dead variable removal, recomputation, loop hoisting and CSE.

Data structures will be passed as arguments, possibly as part of an ABI. Optimisations such as specialisation and not conforming to the ABI when it is not exposed can be applied.
Multigranularity Operation

In some cases, one would want to view a complex datastructure at multiple granularities. For example, given a buffer of complex values, one might wish to reason about dataflow affecting all real values in the buffer, dataflow affecting all imaginary values or dataflow involving the whole buffer. (More complex examples exist)

Debugging

When debugging, it is possible for the data structure to live in a number of different places throughout the program. We can provide a single debug view of all copies, and watch a value wherever it is throughout the lifetime of a program, optionally omitting omit certain accesses such as DMAs.

The same is possible for tracing data structures within the system.
Zero Copying

Using this coarse-grained view, one can achieve zero copy optimization of a sequence of code like this:

- int x[100];
- generate(&x); // writes to x
- put(channel,&x)
  
  by inlining the definition of put to get:
- int x[100];
- generate(&x); // writes to x
- int *px=put_begin(channel);
- copy(px,&x);
- put_end(channel,px);
  
  then reordering the code a little:
- int *pX=put_begin(channel);
- int x[100];
- generate(&x); // writes to x copy(px,&x);
- put_end(channel,px);
  
  and optimizing the memory allocation and copy:
- int *px=putbegin(channel);
- generate(px); // writes to *px
- put_end(channel,px);

Trace

Most of this section is about coarse-grained data structure but some benefits from identifying coarse-grained operations come when we are generating trace. Instead of tracing every scalar operation that is used inside a coarse-grained operation, we can instead just trace the start and stop of the operation. This can also be used for cross-triggering the start/stop of recording other information through trace.

Likewise, instead of tracing the input to/output from the whole sequence of scalar operations, we can trace just the values at the start/end of the operation.
Validating Programmer Assertions

If we rely on programmer assertions, documentation, etc. in performing our dataflow analysis, it is possible that an error in the assertions will lead to an error in the analysis or transformations performed. To guard against these we can often use hardware or software check mechanisms. For example, if we believe that a function should be read but not written by a given function, then we can perform a compile-time analysis to verify it ahead of time or we can program an MMU or MPU to watch for writes to that range of addresses or we can insert instrumentation to check for such errors. We can also perform a ‘lint’ check which looks for things which may be wrong even if it cannot prove that they are wrong. Indeed, one kind of warning is that the program is too complex for automatica analysis to prove that it is correct.

SUMMARY

Some of the features of our approach are:

- Using a register like (aka scalar-like) approach to data structure semantics within the system
- Using liveness analysis to influence memory allocation, parallelism and scheduling decisions.
- Applying register optimisations found in compiler to data structures within a program.
- Providing debugging and tracing of variables as a single view

Overview

Given a program that uses some accelerators, it is possible to make it run faster by executing different parts in parallel with one another. Many methods for parallelizing programs exist but many of them require homogeneous hardware to work and/or require very low cost, low latency communication mechanisms to obtain any benefit. Our compiler uses programmer annotations (many/all of which can be inserted automatically) to split the code that invokes the accelerators (‘the control code’) into a number of parallel “threads” which communicate infrequently. Parallelizing the control code is advantage because it allows tasks on independent accelerators to run concurrently.

Our compiler supports a variety of code generation strategies which allow the parallelized control code to run on a control processor in a real time operating system, in interrupt handlers or in a polling loop (using ‘wait for event’ if available to reduce power) and it also supports distributed scheduling where some control code runs on one or more control processors, some control code runs on programmable accelerators, some simple parts of the code are implemented using conventional task-chaining hardware mechanisms. It is also possible to design special ‘scheduler devices’ which could execute some parts of the control code. The advantage of not running all the control code on the control processor is that it can greatly decrease the load on the control processor.

Other parallelising methods may be used in conjunction with the other aspects of this compiler.

Some of the features of our approach are:

- By applying decoupled software pipelining to the task of parallelizing the control code in a system that uses heterogeneous accelerators, we significantly extend the reach of decoupled software pipelining and by working on coarser grained units of parallelism, we avoid the need to add hardware to support high frequency streaming.
- By parallelizing at a significantly coarser granularity, we avoid the need for low latency, high throughput communication mechanisms used in prior art.
- Parallelizing at a significantly coarser granularity also allows us to duplicate more control code between threads which reduces and simplifies inter-thread communication which allows us to generate distributed schedules. That is, we can distribute the control code across-multiple processors both by putting each control thread on a different processor and by putting different parts of a single control thread onto different processors.
- By optionally allowing the programmer more control over the communication between threads, we are able to overcome the restriction of decoupled software pipelining to acyclic ‘pipelines’.
- The wide range of backends including distributed scheduling and use of hardware support for scheduling.
- Our decoupling algorithm is applied at the source code level whereas existing decoupling algorithms are applied at the assembly code level after instruction scheduling.
  
  Some of the recent known discussions on decoupled software pipelining are:
- Decoupled Software Pipelining: http://liberty.princeton.edu/Research/DSWP/ http://liberty.princeton.edu/Publications/index.php?abs-1&setselect=pact04_dswp http://Iiberty.cs.princeton.edu/Publications/index.php?abs=1&setselect=micro38_d swp
- Automatically partitioning packet processing applications for pipelined architectures, PLDI 2005, ACM http://portal.acm.org/citation.cfm?id=1065010.1065039

A Basic Decoupling Algorithm

The basic decoupling algorithm splits a block of code into a number of threads that pass data between each other via FIFO channels. The algorithm requires us to identify (by programmer annotation or by some other analysis including static analysis, profile driven feedback, etc.) the following parts of the program:

- The “decoupling scope”: that is a contiguous sequence of code that we wish to split into multiple threads. Some ways this can be done are by marking a compound statement, or we can insert a ‘barrier’ annotation that indicates that some parallelism ends/starts here.
- The “replicatable objects”: that is variables and operations which it is acceptable to replicate. A simple rule of thumb is that scalar variables (i.e., not arrays) which are not used outside the scope, scalar operations which only depend on and only modify replicatable variables, and control flow operations should be replicated but more sophisticated policies are possible.
- Ordering dependencies between different operations: if two function calls both modify a non-replicated variable, the order of those two function calls is preserved in the decoupled code. (Extensions to the basic algorithm allow this requirement to be relaxed in various ways.)
- The “data boundaries” between threads: that is, the non-replicatable variables which will become FIFO channels. (The “copies” data annotation described above determines the number of entries in the FIFO.)

(Identifying replicatable objects and data boundaries are two of the features of our decoupling algorithm.)

If we use annotations on the program to identify these program parts, a simple program might look like this:

void main( ) {

int i;

for(i=0; i<10; ++i) {

int x[100] @ {copies=2, replicatable=false; boundary=true} ;

produce(x) @ {replicatable=false, writes_to=[x]};

DECOUPLE(x);

consume(x) @ {replicatable=false, reads_from=[x]};

}

}

At a high level, the algorithm splits the operations in the scope into a number of threads whose execution will produce the same result as the original program under any scheduling policy that respects the FIFO access ordering of the channels used to communicate between threads.

The particular decoupling algorithm we use generates a maximal set of threads such that the following properties hold:

- All threads have the same control flow structure and may have copies of the replicatable variables and operations.
- Each non-replicatable operation is included in only one of the threads.
- Each non-replicatable variable must satisfy one of the following:
  - The only accesses to the variable in the original program are reads; or
  - All reads and writes to the variable are in a single thread; or
  - The variable was marked as a data boundary and all reads are in one thread and all writes are in another thread.
- If two operations have an ordering dependency between them which is not due to a read after write (RAW) dependency on a variable which has been marked as a data boundary, then the operations must be in the same thread.
  
  For the example program above, the maximal set of threads is:

void main( ) {

int x[100] @ {copies=2};

channel c @ {buffers=x};

parallel sections{

section{

int i;

for(i=0; i<10; ++i) {

int x1[100];

produce(x1);

put(c,x1);

}

}

section{

int i;

for(i=0; i<10; ++i) {

int x2[100];

get(c,x2);

consume(x2);

}

}

}

One way to generate this set of threads is as follows:

- 4. For each non-replicatable operation, create a ‘protothread’ consisting of just that operation plus a copy of all the replicatable operations and variables. Each replicatable variable must be initialized at the start of each thread with the value of the original variable before entering the scope and one of the copies of each replicatable variable should be copied back into the master copy on leaving the scope. (Executing all these protothreads is highly unlikely to give the same answer as the original program, because it lacks the necessary synchronization between threads. This is fixed by the next steps.)
- 5. Repeatedly pick two threads and merge them into a single thread if any of the following problems exist:
  - a. One thread writes a non-replicatable variable which is accessed (read or written) by the other thread and the variable is not marked as a data boundary.
  - b. Two threads both write to a variable which is marked as a data boundary.
  - c. Two threads both read from a variable that is marked as a data boundary.
  - d. There is an ordering dependency between an operation in one thread and an operation in the other thread which is not a RAW dependency on a variable marked as a data boundary.
- 6. When no more threads can be merged, quit
  
  Another way if to pick an operation, identify all the operations which must be in the same thread as that operation by repeatedly adding operations which would be merged (in step 2 above). Then pick the next operation not yet assigned to a thread and add all operations which must be in the same thread as that operation. Repeat until there are no more non-replicatable operations. (There are lots of other ways of tackling this problem: basically, we are forming equivalence classes based on a partial order and there are many known ways to do that.)

Note that doing dataflow analysis on arrays one must distinguish defs which are also kills (i.e., the entire value of a variable is overwritten by an operation) and that requires a more advanced analysis than is normally used.
Decoupling Extensions
There are a number of extensions to this model
Range Splitting Preprocessing

It is conventional to use dataflow analysis to determine the live ranges of a scalar variable and then replace the variable with multiple copies of the variable: one for each live range. We use the same analysis techniques to determine the live range of arrays and split their live ranges in the same way. This has the benefit of increasing the precision of later analyses which can enable more threads to be generated. On some compilers it also has the undesirable effect of increasing memory usage which can be mitigated by later merging these copies if they end up in the same thread and by being selective about splitting live ranges where the additional decoupling has little overall-effect on performance.

Zero Copy Optimizations

The put and get operations used when decoupling can be used both for scalar and non-scalar values. (i.e., both for individual values (scalars) and arrays of values (non-scalars) but they are inefficient for large scalar values because they require a memory copy. Therefore, for coarse-grained decoupling, it is desirable to use an optimized mechanism to pass data between threads.

In operating systems, it is conventional to use “zero copy” interfaces for bulk data transfer: instead of generating data into one buffer and then copying the data to the final destination, we first determine the final destination and generate the data directly into the final destination. Applying this idea to the channel interface, we can replace the simple ‘put’ operation with two functions: put_begin obtains the address of the next free buffer in the channel and put_end makes this buffer available to readers of the channel:

- Void* put_begin(channel *ch);
- Void put_end(channel *ch, void* buf);

Similarily, the get operation is split into a get_begin and get_end pair

- Void* get_begin(channel *ch);
- Void get_end(channel *ch, void* buf);
  
  Using these operations, we can often rewrite sequences of code such as:
- Int x[100];
- Generate(x);
- Put(ch,x);
  
  to this more efficient sequence:
- Int *px=put_begin(ch);
- Generate(px);
- Put_end(ch,px);
  
  And similarity, for get:
- Int-x[100];
- Get(ch,x);
- Consume(x);
  
  to this more efficient sequence:
- Int *px=get_begin(ch);
- Consume(px);
- get_end(ch,px);

Note that doing zero copy correctly requires us to take lifetime of variables into account.

We can do that using queues with multiple readers, queues with intermediate r/w points, reference counts or by restricting the decoupling (all readers must be in same thread and . . . ) to make lifetime trivial to track. This can be done by generating custom queue structures to match the code or custom queues can be built out of a small set of primitives.

Dead Code and Data Elimination
This section illustrates both how to get better results and also that we may not get exactly the same control structure but that they are very similar.
Allowing Cyclic Thread Dependencies

Prior art on decoupling restricts the use of decoupling to cases where the communication between the different threads is acyclic. There are two reasons why prior art has done this:

- 2. Cyclic thread dependencies can lead to “Loss of Decoupling”—that is, two threads may not run in parallel because of data dependencies between them.
- 3. A particularity common case of cyclic thread dependencies is code such as

y = 1;

while(1) {

x = f(y);

y = g(x);

}

- Under existing decoupling schemes, puts are always inserted after assignment to any data boundary variable. This would require both a put outside the loop and a put at the end of the loop:

y1 = 1;

put(ch,y1);

while(1) {

y2 = get(ch);

x = f(y2);

y3 = g(x);

put(ch,y3);

}

- Existing decoupling schemes only generate matched pairs of puts and gets (i.e., there is only one put on each channel and only one get on each channel so they cannot generate such code An alternative way of decoupling this code is to generate:

y1 = 1;

while(1) {

put(ch,y1);

y2 = get(ch);

x = f(y2);

y1 = g(x);

}

- This does have matched pairs of puts and gets but breaks the rule of always performing a put after any assignment to a variable so it is also not generated by existing decoupling techniques.

Exposing Channels to the Programmer

It is possible to modify the decoupling algorithm to allow the programmer to insert-puts and gets (or put_begin/end, get_begin/end pairs) themselves. The modified decoupling algorithm treats the puts and gets in much the same way that the standard algorithm treats data boundaries. Specifically, it constructs the maximal set of threads such that:

- Almost all the same conditions as for standard algorithm go here
- All puts to a channel are in the same thread
- All gets to a channel are in the same thread
  
  For example, given this program:

channel ch1;

put(ch1,0);

for(int i=0; i<N); ++i) {

int x = f( );

put(ch1,x);

int y = g(get(ch1));

DECOUPLE(y);

h(x,y);

}

The modified decoupling algorithm will produce:

channel ch1, ch2;

put(ch1,0);

parallel sections{

section{

for(int i=0; i<10; ++i) {

x = f();

put(ch1,x);

int y = get(ch2);

h(x,y);

}

}

section{

for(int i=0; i<10; ++i) {

int y = g(get(ch1));

put(ch2,y);

}

}

This extension of decoupling is useful for creating additional parallelism because it allows f and g to be called in parallel.

Writing code using explicit puts can also be performed as a preprocessing step. For example, we could transform:

for(i=0; i<N; ++i) {

x = f(i);

y = g(i,x);

h(i,x,y);

}

To the following equivalent code:

x = f(0);

for(i=0: i<N; ++i) {

y = g(i,x);

h(i,x,y);

if (i+1<N) x = f(i+1);

}

Which, when decoupled gives very similar code to the above.
(There are numerous variations on this transformation including computing f(i+1) unconditionally, peeling the last iteration of the loop, etc.)
Alternatives to FIFO Channels

A First-In First-Out (FIFO) channel preserves the order of values that pass through it: the first value inserted is the first value extracted, the second value inserted is the second value extracted, etc. Other kinds of channel are possible including:

- a “stack” which has Last in First out (LIFO) semantics. Amongst other advantages, stacks can be simpler to implement
- a priority queue where entries are prioritized by the writer or according to some property of the entry and the reader always received the highest priority entry in the queue.
- a merging queue where a new value is not inserted if it matches the value at the back of the queue or as a variant, if it matches any value in the queue. Omitting duplicate values which may help reduce duplicated work
- a channel which only tracks the last value written to the queue. That is, the queue logically contains only the most recently written entry. This is useful if the value being passed is time-dependent (e.g., current temperature) and it is desirable to always use the most recent value. Note that with fine-grained decoupling the amount of time between generation of the value and its consumption is usually small so being up to date is not a problem; whereas in coarse-grained decoupling, a lot of time may pass between generation and consumption and the data could easily be out of date if passed using a FIFO structure.
- A channel which communicates with a hardware device. For example, a DMA device may communicate with a CPU using a memory mapped doubly-linked list of queue entries which identify buffers to be copied or a temperature sensor may communicate with a CPU using a device register which contains the current temperature.

Using most of these alternative channels has an affect on program meaning so we either have to perform an analysis before using a different kind of channel or the programmer can indicate that a different choice is appropriate/allowed.
Using Locks

In parallel programming, it is often necessary for one thread to need exclusive access to some resource while it is using that resource to avoid a class of timing dependent behaviour known as a “race condition” or just a “race”. The regions of exclusive access are known as “critical sections” and are often clearly marked in a program. Exclusive access can be arranged in several ways. For example, one may ‘acquire’ (aka ‘lock) a ‘lock’ (aka ‘mutex’) before starting to access the resource and ‘release’ (aka ‘unlock’) the lock after using the resource. Exclusive access may also be arranged by disabling pre-emption (such as interrupts) while in a critical section (i.e., a section in which exclusive access is required). In some circumstances, one might also use a ‘lock free’ mechanism where multiple users may use a resource but at some point during use (in particular, at the end), they will detect the conflict, clean up and retry. Some examples of wanting exclusive access include having exclusive access to a hardware accelerator, exclusive access to a block of memory or exclusive access to an input/output device. Note that in these cases, it is usually not necessary to preserve the order of accesses to the resource.

The basic decoupling algorithm avoids introducing race conditions by preserving all ordering dependencies on statements that access non-replicated resources. Where locks have been inserted into the program, the basic decoupling algorithm is modified as follows:

- The ordering dependencies on operations which use shared resources can be relaxed. This requires programmer annotation and/or program analysis which, for each operation which may be reordered, identifies:
  - Which other operations it can be reordered relative to
  - Which operations can simultaneously access the same resource (i.e., without requiring exclusive access)
  - Which critical section each operation occurs in.
- For example, one might identify a hardware device as a resource, then indicate which operations read from the resource (and so can be executed in parallel with each other) and which operations modify the resource (and so must have exclusive access to the resource).
- For simplicity, one might identify all operations inside a critical section as having an ordering dependency between them though one can sometimes relax this if the entire critical section lies inside the scope of decoupling.
- One might determine which critical section each operation occurs in using an analysis which conservatively approximates the set of locks held at all points in the program.

Multithreaded Input

Decoupling can be applied to any sequential section of a parallel program. If the section communicates with the parallel program, we must determine any ordering dependencies that apply to operations within the section (a safe default is that the order of such operations should be preserved). What I'm saying here is that one of the nice properties of decoupling is that it interacts well with other forms of paralellization including manual parallelization.

Decoupling Backends

The decoupling algorithm generates sections of code that are suitable for execution on separate processors but can be executed on a variety of different execution engines by modifying the “back end” of the compiler. That is, by applying a further transformation to the code after decoupling to better match the hardware or the context we wish it to run in.

Multiprocessor and Multithreaded Processor Backends

The Most Straightforward Execution Model is to Execute each Separate Section in the Decoupled Program on a Separate Processor or, on a Processor that Supports Multiple Hardware Contexts (i.e., Threads), to Execute each Separate Section on a Separate Thread.

Since most programs have at least one sequential section before the separate sections start (e.g., there may be a sequential section to allocate and initialize channels), execution will typically start on one processor which will later synchronize with the other processors/threads to start parallel sections on them.

Using Accelerators

In the context of an embedded system and, especially, a System on Chip (SoC), some of the data processing may be performed by separate processors such as general purpose processors, digital signal processors (DSPs), graphics processing units (GPUs), direct memory access (DMA) units, data engines, programmable accelerators or fixed-function accelerators. This data processing can be modelled as a synchronous remote procedure call. For example, a memory copy operation on a DMA engine can be modelled as a function call to perform a memory copy. When such an operation executes, the thread will typically:

- acquire a lock to ensure it has exclusive access to the DMA engine
- configure the DMA engine with the source and destination addresses and the data size
- start the DMA engine to initiate the copy
- wait for the DMA engine to complete the copy which will be detected either by an interrupt to a control processor or by polling
- copy out any result from the copy (such as a status value)
- release the lock on the accelerator
  
  This mode of execution can be especially effective because one ‘control processor’ can keep a number of accelerator's busy with the control processor possibly doing little more than deciding which accelerator to start next and on what data. This mode of execution can be usefully combined with all of the following forms of execution.

RTOS Backend

Instead of a multiprocessor or multithreaded processor, one can use a thread library, operating system (OS) or real time operating system (RTOS) running on one or more processors to execute the threads introduced by decoupling. This is especially effective when combined with the use of accelerators because running an RTOS does not provide parallelism and hence does not increase performance but using accelerators does provide parallelism and can therefore increase performance.

Transforming to Event-Based Execution

Instead of executing threads directly using a thread library, OS or RTOS, one can transform threads into an ‘event-based’ form which can execute more efficiently than threads. The methods can be briefly summarized as follows:

- Transformations to data representation.
- The usual representation of threads allocates thread-local variables on a stack and requires one stack per thread. The overhead of managing this stack and some of the space overhead of stacks can be reduced by using a different allocation policy for thread-local variables based on how many copies of the variable can be live at once and on the lifetime of the variables.
- If only one copy of each variable can be live at once (e.g., if the functions are not required to be re-entrant), then all variables can be allocated statically (i.e., not on a stack or heap).
- If multiple copies of a variable can be live at once (e.g., if more than once instance of a thread can be live at once), the variables can be allocated on the heap.
- Transformations to context-switch mechanism
- When one processor executes more threads than the processor supports, the processor must sometimes switch from executing one thread to executing another thread. This is known as a ‘context switch’. The usual context mechanism used by threads is to save the values of all registers on the stack or in a reserved area of memory called the “thread control block”, then load all the registers with values from a different thread control block and restart the thread. The advantage of this approach is that a context switch can be performed at almost any point during execution so any code can be made multithreaded just by using a suitable thread library, OS or RTOS.
- An alternative mechanism for context switching is to transform each thread to contain explicit context switch points where the thread saves its current context in a thread control block and returns to the scheduler which selects a new thread to run and starts it. The advantages of this approach are that thread control blocks can be made significantly smaller. If all context switches occur in the top-level function and all thread-local variables can be statically allocated, it is possible to completely eliminate the stack so that the entire context of a thread can be reduced to just the program counter value which makes context switches very cheap and makes thread control blocks extremely small.
- A further advantage of performing context switches only at explicit context switch points is that it is easier and faster to ensure that a resource shared between multiple threads is accessed exclusively by at most one thread at a time because, in many cases, it is possible to arrange that pre-emption only happens when the shared resource is not being used by the current thread.
  
  Together, these transformations can be viewed as a way of transforming a thread into a state machine with each context switch point representing a state and the code that continues execution from each context switch point viewed as a transition function to determine the next state. Execution of transformed threads can be viewed as having been transformed to an event-based model where all execution occurs in response to external events such as responses from input/output devices or from accelerators. It is not necessary to transform all threads: event-based execution can coexist with threaded execution.

Interrupt-Driven Execution

Transforming threads as described above to allow event-based execution is a good match for applications that use accelerators that signal task completion via interrupts. On receiving an interrupt signalling task completion the following steps occur:

- the state of the associated accelerator is updated
- all threads that could be blocked waiting for that task to complete or for that accelerator to become available are executed. This may lead to further threads becoming unblocked.
- When there are no runnable threads left, the interrupt handler completes

Polling-Based Execution

Transforming threads as described above is also a good match for polling-based execution where the control processor tests for completion of tasks on a set of accelerators by reading a status register associated with each accelerator. This is essentially the same as interrupt-driven execution except that the state of the accelerators is updated by polling and the polling loop executes until all threads complete execution.

Distributed Scheduling

Distributed scheduling can be done in various ways. Some part of a program may be simple enough that it can be implemented using a simple state machine which schedules one invocation of an accelerator after completion of another accelerator. Or, a control processor can hand over execution of a section within a thread to another processor. In both cases, this can be viewed as a RPC like mechanism (“{foo( ); bar( )@P0;}@P1”). In the first case, one way to implement it is to first transform the thread to event-based form and then opportunistically spot that a sequence of system states can be mapped onto a simple state machine and/or you may perform transformations to make it map better.

Non-Work-Conserving Schedulers and Priorities/Deadlines

Two claims in this section: 1) using a priority mechanism and 2) using a non-work-conserving scheduler in the context of decoupling

If a system has to meet a set of deadlines and the threads within the system share resources such as processors, it is common to use a priority mechanism to select which thread to run next. These priorities might be static or they may depend on dynamic properties such as the time until the next deadline or how full/empty input and output queues are.

In a multiprocessor system, using a priority mechanism can be problematic because at the instant that one task completes, the set of tasks available to run next is too small to make a meaningful choice and better schedules occur if one waits a small period of time before making a choice. Such schedulers are known as non-work-conserving schedulers.

Overview

A long-standing problem of parallelizing compilers is that it is hard to relate the view of execution seen by debug mechanisms to the view of execution the programmer expects from the original sequential program. Our tools can take an execution trace obtained from running a program on parallel hardware and reorder it to obtain a sequential trace that matches the original program. This is especially applicable to but not limited to the coarse-grained nature of our parallelization method.

To achieve complete reconstruction, it helps if the parallelizing compiler inserts hints in the code that make it easier to match up corresponding parts of the program. In the absence of explicit hints, it may be possible to obtain full reconstruction using debug information to match parts of the program.

When there are no explicit hints or debug information, partial reconstruction can be achieved by using points in the program that synchronize with each other to guide the matching process. The resulting trace will not be sequential but will be easier to understand. A useful application is to make it simpler to understand a trace of a program written using an event-based programming style (e.g., a GUI, interrupt handlers, device drivers, etc.)

Partial reconstruction could also be used to simplify parallel programs running on systems that use release consistency. Such programs must use explicit memory barriers at all synchronization points so it will be possible to simplify traces to reduce the degree of parallelism the programmer must consider.

One simple case of this is reconstructing a ‘message passing’ view of bus traffic.

HP has been looking at using trace to enable performance debugging of distributed protocols. Their focus is on data mining and performance not reconstructing a sequential trace. http://portal.acm.org/citation.cfm?id=945445.945454&d1=portal&d1=ACM&type=series&idx=945445&part=Proceedings&WantType=Proceedings&title=ACM %20Symposium %20 on %20Operating %20Systems %20Principles&CFID=111111111&CFTOKEN=2222222

Partial Reconstruction Based on Observed Dataflow

Suppose we can identify sections of the system execution and we have a trace which lets us identify when each section was running and we have a trace of the memory accesses they performed or, from knowing properties of some of the sections, we know what memory accesses they would perform without needing a trace. The sections we can identify might be:

- function calls
- remote procedure calls
- execution of a fixed-function accelerator such as a DMA transfer
- message passing

We can summarize the memory accesses of each section in terms of the input data and the output data (what addresses were accessed and, perhaps, what values were read or written).

Given a sequence of traces of sections, we can construct a dynamic dataflow graph where each section is a node in a directed graph and there is an edge from a node M to a node N if the section corresponding to M writes to an address x and the section corresponding to N reads from address x and, in the original trace, no write to x happens between M's write to x and N's read from x.

This directed dataflow graph shows how different sections communicate with each other and can be used for a variety of purposes:

- identify potential parallelism
- identify timing-sensitive behaviour such as race conditions (when combined with a trace of synchronizations between parallel threads): if M writes to x and N reads from x and there is no chain of synchronizations from M to N to ensure that N cannot read from x before M does the read, there is a potential problem
- identify redundant memory writes (if a value is overwritten before it has been read)
- provides a simple way to show programmers what is happening in a complex, possibly parallel, system
- can be analyzed to determine the time between when data is being generated and when it is consumed. If the time is long it might suggest that memory requirements could be reduced by calculating data nearer the time or, in a parallel or concurrent system that the generating task can be executed later.
- can be analyzed to identify number and identity of consumers of data: it is often possible to manage memory more efficiently or generate data more efficiently if we know what it is being used for, when it is being used, etc.

Many other uses exist.
Full Reconstruction Based on Parallelization Transformations

The first section talks about what you need for the general case of a program that has been parallelized and you would like to serialize trace from a run of the parallel program based on some understanding of what transformations were done during parallelization (i.e., you know how different bits of the program relate to the original program). The second part talks about how you would specifically do this if the paralellization process included decoupling. The sketch describes the simplest case in which it can work but it is possible to relax these restrictions significantly.

Here is a brief description of what is required to do trace reconstruction for decoupled programs. That is, to be able to take a trace from the decoupled program and reorder it to obtain a legal trace of the original program.

Most relevant should be conditions 1-9 which say what we need from trace. Where the conditions do not hold, there need to be mechanisms to achieve the same effect or a way of relaxing the goals so that they can still be met. For example, if we can only trace activity on the bus and two kernels running on the same DE communicate by one leaving the result in DE-local memory and the other using it from there, then we either add hardware to observe accesses to local memories or we tweak the schedule to add a spurious DMA copy out of the local memory so that it appears on the bus or we pretend we didn't want to see that kind of activity anyway.

Condition 10 onwards relate mainly to what decoupling aims to achieve. But, some conditions are relevant such as conditions 5 and 6 because, in practice, it is useful to be able to relax these conditions slightly. For example (5) says that kernels have exclusive access to buffers but it is obviously ok to have multiple readers of the same buffer and it would also be ok (in most real programs) for two kernels to (atomically) invoke ‘malloc’ and ‘free’ in the middle of the kernels even though the particular heap areas returned will depend on the precise interleaving of those calls and it may even be ok for debugging printfs from each kernel to be ordered.

Initial assumptions (to be relaxed later):

- 1. Trace can see the start and end of each kernel execution and can identify which kernel is being started or is stopping.
- 2. Trace can see context switches on each processor and can identify which context we are leaving and which context we are entering.

Consequences of (1)-(2): We can derive which kernel instance is running on any processor at any time.

- 3. Trace has a coherent, consistent view of all activity on all processors.
- 4. Trace can identify the source of all transactions it observes.
  - Two mechanisms that can make this possible are:
  - 1. Trace might observe directly which processor caused a transaction. or
  - 2. Trace might observe some property of the transaction such as the destination address and combine that with some property of the kernels running at that time.

Condition 2 can be satisfied if we have each kernel only accesses buffers that are either:

- 1. At a static address (and of static length); or
- 2. At an address (and of a length) that are handed to the kernel at the start of kernel execution and trace can infer what that address and length are.

Consequences of (1)-(4): We can identify each transaction with a kernel instance and we can see all transactions a kernel performs.

- 5. Each kernel instance has exclusive access to each buffer during its execution. That is, all inter-kernel communication occurs at kernel boundaries.
- 6. Each kernel's transactions only depend on the state of the buffers it accesses and the state of those buffers only depends on the initial state of the system and on transactions that kernels have performed since then.
  
  Consequences of (1)-(6): Given a trace consisting of the interleaved transactions of a set of kernel instances, we can reorder the transactions such that all transactions of a kernel are contiguous and the resulting trace satisfies all read after write data dependencies. That is, we can construct a sequentially consistent view of the transactions as though kernels executed atomically and sequentially.

Note that there may be many legal traces. e.g., if A (only) writes to address 0 and then 1 and B (only) writes to address 2 and then 3 then the trace ‘0,2,1,3’ could be reordered to ‘0, 1, 2, 3’ or to ‘2,3,0,1’.

- 7. Sequencing of each kernel instance is triggered by a (single) state machine. There are a number of parallel state machines. (State machines may be in dedicated hardware or a number of state machines may be simulated on a processor.)
- 8. State machines can synchronize with each other and can wait for completion of a kernel and state transitions can depend (only) on those synchronizations and on the results of kernels.
- 9. Trace has a sequentially consistent, coherent view of all state transitions of the sequencers and all synchronization.
  
  Consequences of (7)-(9): Given a trace of the state transitions and synchronizations, we can reorder them into any of the set of legal transitions those state machines could have made where a transition is legal if it respects synchronization dependencies.

Consequences of (1)-(9): Given a trace of all kernel transactions and all state transitions and synchronizations, we can reorder them into any legal trace which respects the same synchronization dependencies and data dependencies.
The challenge of trace reconstruction is to show that, if you decouple a program, then the following holds. (Actually, this is what you want to show for almost any way you may parallelize a program.)

- 10. We assume that we have a single ‘master’ deterministic state machine that corresponds to the set of parallel, deterministic state machines in the following way:
  - a. Any trace of the ‘master’ state machine is a legal trace of the parallel state machine.
  - b. Some traces of the parallel state machine can be reordered into a legal trace of the master state machine.
  - c. Those traces of the parallel state machine that cannot be reordered to give a legal trace of the master, are a prefix of a trace that can be reordered to give a legal trace of the master.
- That is, any run of the parallel machine can be run forward to a point equivalent to a run of the master state machine.

(We further assume that we know how to do this reordering and how to identify equivalent points.)
Consequences of (1)-(10): We can reorder any trace to match a sequential version of the same program.

To show that decoupling gives us property (10) (i.e., that any trace of the decoupled program can be reordered to give a trace of the original program and to show how to do that reordering), we need to establish a relationship between the parallel state machine and the master state machine (i.e., the original program). This relationship is an “embedding” (i.e., a mapping between states in the parallel and the master machines such that the transitions map to each other in the obvious way). It is probably easiest to prove this by considering what happens when we decouple a single state machine (i.e., a program) into two parallel state machines.

When we decouple, we take a connected set of states in the original and create a new state machine containing copies of those states but:

- 1. The two machines synchronize with each other on all transitions into and out of that set of states.
- 2. The two machines contain a partition of the kernel activations of the original machine.
- 3. The two machines each contain a subset (which may overlap) of the transitions of the original machine.
  
  From this, it follows that the parallel machine_can_execute the same sequence as the original machine. To show that it_must_execute an equivalent sequence (i.e., that we can always reorder the trace), we need the following properties of decoupling:
- 4. All data dependencies are respected: if kernel B reads data written by kernel A, then both are executed in sequence on the same state machine or the state machines will synchronize after A completes and before B starts.
- Note that this depends on the fact that channels are FIFO queues so data is delivered in order.
  
  Extensions of decoupling allow the programmer to indicate that two operations can be executed in either order even though there is a data dependency between them (e.g., both increment a variable atomically). This mostly needs us to relax the definition of what trace reconstruction is meant to do. One major requirement is that the choice of order doesn't have any knock-on effects on control flow.

5. Deadlock should not happen:

- threads cannot block indefinitely on a put as long as each queue has space for at least one value.
- threads cannot block indefinitely on a get: either one thread is still making progress towards a put or, if they both hit a get, at least one will succeed.

Outline proof: Because they share the same control flow, the two threads perform opposing actions (i.e., a put/get pair) on channels in the same sequence as each other. A thread can only block on a get or a put if it has run ahead of the other thread. Therefore, when one thread is blocked, the other is always runnable.

Extensions of Decoupling Allow for the following.

1. Locks are added by the programmer.

To avoid deadlock, we require:

- The standard condition that locks must always be obtained in a consistent order.
- If the leading thread blocks on a channel while holding a lock, then the trailing thread cannot block on the same lock.

A sufficient (and almost necessary) condition is that a put and a get on the same channel must not be inside corresponding critical sections (in different threads):

// Not allowed

parallel_sections{

section{ ... lock(l); ... put(ch,x); ... unlock(l); ...}

section{ ... lock(l); ... get(ch,x); ... unlock(l); ...}

}

- which means that the original code cannot have looked like this:
- . . . lock(1); . . . DECOUPLE(ch,x); . . . unlock(1); . . .

That is, extreme care must be taken if DECOUPLE occurs inside a critical section especially when inserting DECOUPLE annotations automatically

2. Puts and gets don't have to occur in pairs in the program.

A useful and safe special case is that all initialization code does N puts, a loop then contains only put-get pairs and then finalization code does at most N gets. It should be possible to prove that this special case is ok.

It might also be possible to prove the following for programs containing arbitrary puts and gets: if the original single-threaded program does not deadlock (i.e., never does a get on an empty channel or a put on a full channel), then neither will the decoupled program.

Overview

A long-standing problem of parallelizing compilers is that it is virtually impossible to provide the programmer with a start-stop debugger that lets them debug in terms of their sequential program even though it runs in parallel. In particular, we would like to be able to run the program quickly (on the parallel hardware) for a few minutes and then switch to a sequential view when we want to debug.

It is not necessary (and hard) to seamlessly switch from running parallel code to running sequential code but it is feasible to change the scheduling rules to force the program to run only one task at a time. With compiler help, it is possible to execute in almost the sequence that the original program would have executed. With less compiler help or where the original program was parallel, it is possible to present a simpler schedule than on the original program. This method can be applied to interrupt driven program too.

This same method of tweaking the scheduler while leaving the application unchanged can be used to test programs more thoroughly. Some useful examples:

- Testing the robustness of a real time system by modifying the runtime of tasks. Making a task longer may cause a deadline to be missed. Making a task longer may detect scheduling anomalies where the system runs faster if one part becomes slower. (Scheduling anomalies usually indicate that priorities have been set incorrectly.) Making tasks take randomly longer times establishes how stable a schedule is.
- Providing better test coverage in parallel systems. Race conditions and deadlock often have a small window of opportunity which it is hard to detect in testing because the ‘windows’ of several threads have to be aligned for the problem to manifest. By delaying threads by different amounts, we can cause different parts of each thread to overlap so that we can test a variety of alignments. (We can also measure which alignments we have tested so far for test-coverage statistics and for guided search.) This is especially useful for interrupt-driven code.
  
  John Regehr did some work on avoiding interrupt overload by delaying and combining interrupts. http://portal.acm.org/citation.cfm?id=945445.945454&d1=portal&d1=ACM&type=series&idx=945445&part=Proceedings&WantType=Proceedings&title=ACM %20Symposium %20 on %20Operating %20Systems %20Principles&CFID=11111111&CFTOKEN=2222222
  
  but this is really about modifying the (hardware) scheduling of interrupts to have more desirable properties for building real time systems whereas we are more interested in:
- debugging, tracing. and testing systems (and some of the stuff we do might actually break real-time properties of the system)
- thread schedulers (but we still want to do some interrupt tweaking)

Testing Concurrent Systems

Errors in concurrent systems often stem from timing-dependent behaviour. It is hard to find and to reproduce errors because they depend on two independently executing sections of code executing at the same time (on a single-processor system, this means that one section is preempted and the other section runs). The problematic sections are often not identified in the code.

Concurrent systems often have a lot of flexibility about when a particular piece of code should run: a task may have a deadline or it may require that it receive 2 seconds of CPU in every 10 second interval but tasks rarely require that they receive a particular pattern of scheduling.

The idea is to use the flexibility that the system provides to explore different sequences from those that a traditional scheduler would provide. In particular, we can use the same scheduler but modify task properties (such as deadlines or priorities) so that the system should still satisfy real time requirements or, more flexibly, use a different scheduler which uses a different schedule.

Most schedulers in common use are ‘work conserving schedulers’: if the resources needed to run a task are available and the task is due to execute, the task is started. In contrast, a non-work-conserving scheduler might choose to leave a resource idle for a short time even though it could be used. Non-work-conserving schedulers are normally used to improve efficiency where there is a possibility that a better choice of task will become available if the scheduler delays for a short time.

A non-work-conserving scheduler for testing concurrent systems because they provide more flexibility over the precise timing of different tasks than does a work-conserving scheduler. In particular, we can exploit flexibility in the following way:

- model the effect of possibly increased runtime of different tasks. e.g., if task A takes 100 microseconds and we want to know what would happen if it took 150 microseconds, the scheduler can delay scheduling any tasks for 50 microseconds after A completes. A special case is uniformly slowing down all tasks to establish the ‘critical scaling factor’. Another interesting thing to watch for is ‘scheduling anomalies’ where a small change in the runtime of a task can have a large effect on the overall schedule and, in particular, where increasing the runtime of one task can cause another task to execute earlier (which can have both good and bad effects).
- model the effect of variability in the runtime of different tasks by waiting a random amount of time after each task completes
- cause two tasks to execute at a range of different phases relative to each other by delaying the start of execution of one or the other of the tasks by different amounts. Where the tasks are not periodic, (e.g., they are triggered by external events) you might delay execution of one task until some time after the other task has been triggered.
  
  In all these cases, the modification of the schedule is probably done within the constraints of the real-time requirements of the tasks. For example, when a task becomes runnable, one might establish how much ‘slack’ there is in the schedule and then choose to delay the task for at most that amount. In particular, when exploring different phases, if the second event doesn't happen within that period of slack, then the first event must be sent to the system and we will hope to explore that phase the next time the event triggers.

It is often useful to monitor which different schedules have been explored either to report to the programmer exactly what tests have been performed and which ones found problems or to drive a feedback loop where a test harness keeps testing different schedules until sufficient test coverage has been achieved.

Debugging Parallel Systems

When a sequential program is parallelized, it is often the case that one of the possible schedules that the scheduler might choose causes the program to execute in exactly the same order that the original program would have executed. (Where this is not true, such as with a non-preemptive scheduler it is sometimes possible to insert pre-emption points into the code to make it true.)

If the scheduler is able to determine what is currently executing and what would have run next in the original program, the scheduler can choose to execute the thread that would run that piece of code. (Again, it may be necessary to insert instrumentation into the code to help the scheduler figure out the status of each thread so that it can execute them in the correct order.)

Tracing Parallel Systems
Reduce amount of reordering required by reordering the execution which might reduce size of window required, simplify task of separating out parallel streams of memory accesses, eliminate the need to reorder trace at all, etc.
Overview

Working with the whole program at once and following compilation through many different levels of abstraction allows us to exploit information from one level of compilation in a higher or lower level. Some examples:

- Executing with very abstract models of kernels can give us faster simulation which gives visualization, compiler feedback and regression checks on meeting deadlines.
- We can plug a high level simulator of one component into a low level system simulation (using back-annotation of timing) and vice-versa.
- We can simulate at various levels of detail: trace start/stop events (but don't simulate kernels), functional simulation using semihosting, bus traffic simulation, etc.
- We can use our knowledge of the high level semantics to insert checking to confirm that the high-level semantics is enforced. For example, if a kernel is supposed to access only some address ranges, we can use an MPU to enforce that.
- We can reconstruct a ‘message-passing view’ of bus traffic.

Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.

Analyzing diagnostic data generated by multiple threads within an instruction stream

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

US Classifications

International Classifications

Abstract

Description

Claims

Provisional Applications (1)