The present invention relates to a diagnostic apparatus and a corresponding method for generating diagnostic data relating to processing of an instruction stream.
Computer programs are typically subject to intensive testing and debugging in order to ensure they will function reliably when executed. Where a computer program has been compiled from source code, such testing and debugging should also be carried out on the compiled program. One particular type of compiler can transform a program with only one sequence of instructions into a program with multiple sequences of instructions (referred to hereinafter as multiple threads) which can, to a certain degree, be executed in parallel if run on a multi-processor system. Such a compiler may be referred to as a parallelising compiler. While a multi-threaded program generated in this way can make efficient use of system resources when executed on a multi-processor system, it becomes difficult to debug the compiled program because the debugger view of the source program may be completely different from the debugger view which would be provided in respect of the source program. In particular, it may not be possible to set breakpoints at the same positions in the program (for example inside loops that have been parallelised), and different runs of the program on the same data may provide different debug views depending on how the debugger is invoked.
Additionally, a problem with parallel programs is that testing a multi-threaded program can be problematic because the behaviour of the program can, often incorrectly, depend on the precise timing behaviour of the different threads, and a small perturbation of the system, due for instance to inputs of other users or bus contention, can affect that timing.
The above problems are particularly apparent in the case of system-on-chip (SoC) devices, which are widely available in the form of consumer electronic devices such as mobile phones. SoC devices may rely heavily on parallel processing in order to provide high performance and low power consumption. Additionally, as embedded systems, the debugging of software applications on SoC devices is more difficult and requires the use of external hardware and software. It is thus highly desirable in this context to provide an improved and more programmer-friendly mechanism for debugging parallel programs.
According to one aspect of the present invention, there is provided a diagnostic method for generating diagnostic data relating to processing of an instruction stream, wherein said instruction stream has been compiled from a source instruction stream to include multiple threads, said method comprising the steps of:
(i) initiating a diagnostic procedure in which at least a portion of said instruction stream is executed;
(ii) controlling a scheduling order for executing instructions within said at least a portion of said instruction stream to cause execution of a sequence of thread portions, said sequence being determined in response to one or more rules, at least one of said rules defining an order of execution of said thread portions to follow an order of said source instruction stream.
The present invention addresses the above problems by allowing the diagnostic procedure to generate a debug view of a parallelised program which is the same as, or at least similar to, a debug view which would be provided when debugging the original non-parallelised program. This makes it easier for the programmer to debug the parallelised program, because the order of execution of instructions in the parallelised program will be at least similar to the order of execution of the respective instructions in the original non-parallelised program, which the programmer will have written himself, and thus will understand. Additionally, this diagnostic procedure will provide a more consistent debug view of the parallelised program, because the timing behaviour of the different threads of the program can be controlled by the one or more rules. Clearly, it is desirable for the order of execution of the parallel program to be as close as possible to the order of execution of the original program, and thus preferably at least one of said rules defines an order of execution of said thread portions which substantially matches an order of said source instruction stream. It should be appreciated that the rule defining an order of the source instruction stream may specify that order and try to apply it to the compiled instruction stream but may in some circumstances be overridden by other rules. For instance a rule ensuring that the parallel program meets deadlines for performing an intended function may override the rule defining the order of the source instruction stream.
The above advantages are not exhibited by existing debuggers for parallel programs, which often restrict the debug view at a given time to only those parts of the parallel program which correspond to the original source program. For example, if the program initialises a data structure, then splits into four threads to modify the data structure, then waits for the four threads to complete before continuing execution, then the debugger may disallow observation of operations on the data structure during the time that multiple threads are modifying it, because the state of the data structure may not reflect any valid state of the original unthreaded program. Other existing debuggers may allow the programmer to observe any operation at any point in the parallel program, but will require the programmer both to understand how the program was parallelised, and to directly debug the multithreaded program, which is considerably harder to do. The present invention seeks to reduce the programmer's exposure to the parallelism of the multithreaded program.
Embodiments of the present invention may be applied to system-on-chip (SoC) devices.
In some embodiments said at least one of said rules defines an order of execution of said thread portions which substantially matches an order of said source instruction stream. This is clearly the easiest arrangement to debug, however, it may not always be possible to provide such an order of execution.
It will be appreciated that while the source program could consist of a single thread, which is then compiled (parallelised) to include multiple threads, the source program could itself be a parallel program, which is then compiled to increase parallelism by adding further threads. In this latter case, the diagnostic procedure may generate a debug view which exposes the programmer to some parallelism, in particular the parallelism of the original program, but this will still be easier for the programmer to understand and debug than the fully multithreaded object program.
In some embodiments one of the rules may comprise:
detecting when execution of a currently executing thread reaches a switching point in said instruction stream, and blocking said currently executing thread from further execution; and
determining a currently inactive thread which is runnable, and executing said instruction stream associated with said currently inactive thread.
This rule may serve to perform one or both of inhibiting parallelism, and reducing thread interleaving, either or both of which will tend to result in an instruction execution order similar to that of the original source code, in which parallelism is either not present or reduced, and potential threads of instructions are often set out in a non-interleaved manner. The effectiveness of this rule in modifying the instruction execution order to reduce parallelism and to match the original source code order may depend on the switching points used. For instance, one or more of the switching points may be communication points between threads which occur when a currently executing thread makes a value available to another thread. This may particularly be the case where variables are not shared between different threads, but a value to be shared between threads is instead passed from one thread to another over a communication channel. When a value is passed between threads in this way, it will often be the case that the flow of execution should switch from one thread to another in the debug mode in order to mimic the order of execution of the original source program.
One or more of the switching points may be a synchronisation point at which one or more threads switches from a runnable state to a non-runnable state, or from a non-runnable state to a runnable state.
Communication points and synchronisation points are particularly suitable for use as switching points, because they can be readily discerned from the parallel code.
Communication points and synchronisation points are types of switching point which are inherently present in the compiled program code. It may however be necessary to add switching points to the program code to facilitate the modified scheduling order required to execute the parallel code in the same order as the original code. In this case, one or more thread yield instructions may be added by a compiler as switching points when the source instruction stream is compiled. Such a thread yield instruction may for instance be added to a thread when a compilation of an instruction from the source instruction stream does not generate a corresponding instruction in that thread.
The above switching points are provided within the object program code itself. However, it is also possible to add one or more breakpoints during execution of said instruction stream as switching points. This can be done either as an alternative to the use of communication points, synchronisation points and/or thread yield instructions, or as additional switching points. A position of the breakpoints may be determined from data generated by a compiler during a compilation of the source instruction stream.
One or more of the rules used to define the scheduling order may be generated from sequence data which was in turn generated during compilation of the instruction stream from the source instruction stream, with the sequence data being indicative of an order of the source instruction stream. The sequence data may be a discrete file, or may form part of a debug map which provides a correspondence between instructions of the source code and instructions of the object code.
According to another aspect of the invention, there is provided a diagnostic apparatus for generating diagnostic data relating to processing of an instruction stream, wherein said instruction stream has been compiled from a source instruction stream to include multiple threads, said diagnostic apparatus comprising:
a diagnostic engine for initiating a diagnostic procedure in which at least a portion of said instruction stream is executed; and
a scheduling controller for controlling a scheduling order for executing instructions within said at least a portion of said instruction stream to cause execution of a sequence of thread portions determined in response to one or more rules, at least one of said rules defining an order of execution of said thread portions to follow an order of said source instruction stream.
According to another aspect of the invention, there is provided a method of compiling an instruction stream from a source instruction stream to include multiple threads, comprising the step of:
generating sequence data during compilation of said source instruction stream, said sequence data being indicative of an order of said source instruction stream.
According to another aspect of the invention, there is provided a parallelising compiler for compiling an instruction stream from a source instruction stream to include multiple threads, the compiler comprising:
a sequence data generator operable to generate sequence data during compilation of said source instruction stream, said sequence data being indicative of an order of said source instruction stream.
Various other aspect and features of the present invention are defined in the claims, and include a computer program product.
The above, and other objections, features and advantages of this invention will be apparent from the following detailed description of illustrative embodiments which is to be read in connection with the accompanying drawings.
Referring to
Program code for execution by a data processing system basically comprises a list of instructions which are traditionally executed sequentially by a processor. While this list is often broken down into multiple functions and sub-routines, it would traditionally still be executed sequentially, with the processor executing each instruction in turn before moving on to the next instruction in the sequence. However, in the case of a multithreaded program, the list of instructions is constructed in such a way that certain instructions or groups of instructions can be executed at the same time on different processors. It will be appreciated that there will be limits to which instructions can be executed in parallel. For instance, there will be interrelationships in the program code which will require certain instructions to be executed before others. For example, in order for a variable var to be read, a value should previously have been assigned to the variable var, and so an instruction to read the variable var should not be executed until after the instruction to write a value to the variable var. Accordingly, it will be understood that certain elements of program code should be executed sequentially in order for them to function correctly. However, other elements of program code can be executed independently of each other, and thus can be executed in parallel on a multi-processor data processing system.
Two main types of program parallelism are possible. The first of these, task parallelism, occurs where two different tasks are executed in parallel, either on the same or different data. For example, in the context of
Instruction (a) sets up a loop in which a variable i is initialised to zero on first execution and then incremented by 1 for each cycle of the loop. The loop is specified to continue until the value of variable i reaches a value N. Within the loop, instruction (b) determines a value for a variable x in accordance with a function P( ), and instruction, (c) executes a function Q(_) on the value stored in variable x. Instruction (d) closes the loop. It will be understood that instructions (b) and (c) can be described as data processing instructions which perform an operation on data values, whereas instructions (a) and (d) constitute control instructions which control if and when the data processing instructions can be executed. Although data processing instruction (c) depends on a result of data processing instruction (b), it is possible to execute instructions (b) and (c) in parallel by executing instruction (c) on a value of x determined in the previous cycle of the loop while the current cycle of the loop determines a new value for x. This can be achieved by splitting instructions (a) to (d) into two threads as shown in Table 1:
It can be seen from Table 1 that thread 1 comprises control instructions (a1) and (d1) which correspond to the control instructions (a) and (d) of the original code and that thread 2 comprises control instructions (a2) and (d2) which also correspond to the control instructions (a) and (d) of the original code. Thread 1 includes a data processing instruction (b1) which corresponds to the data processing instruction (b) of the original code, and also an instruction (e) which places the value of variable x generated by instruction (b1) into a communication channel using a put command. Thread 1 does not include an instruction corresponding to data processing instruction (c) of the original code, because this is provided separately in thread 2. Thread 2 includes an instruction (f) which obtains a value x from the communication channel using a get command, and also includes a data processing instruction (c2) which corresponds to the data processing instruction (c) of the original code. In particular, data processing instruction (c2) operates on the value of x obtained from the communication channel by instruction (f). Thread 2 does not include an instruction corresponding to data processing instruction (b) of the original code, because this is provided separately in thread 1. When executed, thread 1 generates a value for x at each cycle of the loop and places this value in a communication channel, where it can be obtained by thread 2 in the following cycle of the loop. While thread 2 is processing the value of x obtained from the communication channel, thread 1 will be generated a new value of x and placing it on the communication channel. In this way, data processing instructions (b) and (c) of the original code can be executed in parallel in a multithreaded version of the original code.
The other type of program parallelism, data parallelism, occurs where the same task is executed in parallel on different data. For example, in the context of
Consider the following sequence of instructions:
Instruction (j) sets up a loop in which a variable i is initialised to zero on first execution and then incremented by 1 for each cycle of the loop. The loop is specified to continue until the value of variable i reaches a value of 100. Within the loop, instruction (k) performs a function R on a value Input[i] of an array Input of values. Each cycle of the loop results in function R being performed on a different value within the array due to the fact that the index i to the array is incremented for each cycle. Instruction (l) closes the loop. It will be understood that instruction (k) can be described as a data processing instruction, whereas instructions (j) and (l) constitute control instructions. Parallelism can be introduced in this case by performing the function R on multiple different values concurrently. This can be achieved by splitting instructions (j) to (l) between two threads as shown in Table 2:
It can be seen from Table 2 that thread 1 comprises control instructions (j1) and (l1) which mainly correspond to the control instructions (j) and (l) of the original code and that thread 2 comprises control instructions (j2) and (l2) which also mainly correspond to the control instructions (h) and (l) of the original code. Thread 1 includes a data processing instruction (k1) which corresponds to the data processing instruction (k) of the original code, and thread 2 includes an instruction (k2) which also corresponds to the data processing instruction (k) of the original code. However, the slight difference between instruction (j1) and (j), and (j2) and (j) provides the parallelism in this case. In particular, it can be seen that instruction (j1) sets up a loop in which the variable i ranges from 0 to 49 compared with the range of 0 to 99 set up by instruction (j) of the original code, and that instruction (j2) sets up a loop in which the variable i ranges from 50 to 99 compared with the range of 0 to 99 set up by instruction (j) of the original code. In this way, the first thread carries out function R in respect of one half of the array Input[ ] and the second thread carries out function R in respect of the other half of the array Input[ ]. In this way, the same data processing task, function R, can be executed in parallel using two threads on two separate processors using different data.
As described above, program code can be adapted to add parallelism, thereby enabling an increase in performance when executed on a multi-processor system. The addition of parallelism can be achieved by using a parallelising compiler as schematically illustrated in
While the parallelism introduced by the parallelising compiler 200 makes the execution of the object code more efficient when run on a multi-processor system, the process of debugging the object code is, as described above, usually much more challenging, because the order in which instructions are executed may differ greatly from the order in which the corresponding instructions would be executed in the original source code. Accordingly, it is desirable when debugging the object code to execute or step through the object code in an order which mimics the original execution order of the source code. Referring to
The rescheduling shown in
In addition to communication points, other suitable places in the code can be used as switching points. For example, synchronisation points at which one or more threads switches from a runnable state to a non-runnable state, or from a non-runnable state to a runnable state, also constitute suitable switching points. Examples of synchronisation points include points in a thread which may require another parallel thread to catch up before the thread can continue execution.
Additionally, and particularly where there are an insufficient number of communication points or synchronisation points, switching points can be added into the code, either at compile-time by the compiler inserting thread yield instructions, or at run-time in the form of breakpoints. In the case of adding breakpoints, it is possible to force a context switch to happen at a particular point in the program by inserting a breakpoint and suspending a current thread when that breakpoint is reached.
A debugging apparatus which utilises the above method is schematically illustrated with reference to
The ICE is a hardware device which enables the development system 410 to access the data processing system 100 via the Debug Access Port 430, and which enables programs to be loaded into the data processing system 100. The program so-loaded can be executed and/or stepped through under the control of the programmer. The development system 410 may be a dedicated test device or a general purpose computer, in either case being provided with a debugger application 415 which provides an interactive user interface for the programmer to investigate and control the data processing system 100.
In normal operation, the data processing system 100 will execute program code in accordance with a scheduling order defined by a scheduling function of the control processor 110. However, when operating in a debug mode under the control of the development system 410, program code is executed using an alternative scheduling order defined by the debugger application. This alternative scheduling order results from one or more rules intended to cause the program code to be executed in an order which follows an order of a source instruction stream from which the program code was compiled. In the present case, the rules are defined at least in part based on sequence data generated when the source instruction stream was compiled into the program code, and made available to the debugger application. The sequence data would represent an instruction order of the source instruction stream. Alternatively, in the absence of such sequence data, the rules may be based on an assumed instruction order of the source instruction stream. It will be appreciated that it may not always be possible to execute the program code in an order which identically matches the order of the source instruction stream, because to do so may in some circumstances result in the program failing to meet a deadline and thus causing an error. In other words, the present technique takes advantage of the flexibility which usually exists in the scheduling of program code execution, but as a result requires there to be some slack in the schedule because if it is not possible to delay execution of a task because a deadline would be missed, the present technique may not safely be applied to that task.
The present technique may slow execution to be less than that of the original sequential program. However, to overcome this, the program can be run at full speed (without rescheduling) until a particular event occurs and then switch to a slower debug mode (with rescheduling) while debugging the system. It is generally acceptable to run more slowly in a debug mode because the slowest part of the system is the programmer typing debug commands.
Referring to
The remaining steps relate to the debugging of the object code. At a step S4, the object code is executed in a debug mode. During execution, it is determined at a step S5 whether a switching point has been reached. As described above, the switching point could be a communication point, a synchronisation point or a thread yield instruction. If a switching point has not been reached, the currently executing code may optionally be displayed to the programmer as a debug view at a step S6. If however a switching point has been reached, the debug scheduler is invoked at a step S7. The scheduler determines, at a step S8, the next thread to be executed. This determination is conducted based on one or more rules, at least one of which is intended to force the instruction execution order of the object code to follow the order of the source code. At a step S9, the thread selected at the step S8 is executed, and all other threads are blocked. From the step S9, the process moves to the step S6, where the currently executing code may be displayed. In this way, the object code is executed sequentially, preferably in an order of the source code. It will be appreciated that, in some embodiments, the programmer may not be provided with a real time visual display, or may only be provided with a visual display periodically during execution of the code.
Although particular embodiments have been described herein, it will be appreciated that the invention is not limited thereto and that many modifications and additions thereto may be made within the scope of the invention. For example, various combinations of the features of the following dependent claims can be made with the features of the independent claims without departing from the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
0717706.6 | Sep 2007 | GB | national |
Number | Date | Country | |
---|---|---|---|
60853756 | Oct 2006 | US |