The present invention is generally directed to implementing exception handling in computer program code, and in particular, to implementing exception handling in computer program code run on a graphics processing unit.
The problem that is addressed herein is to support the exception handling feature of programming languages for graphics processing unit (GPU) computing applications on GPU architectures. Many programming languages, such as C++, Java, C#, Python, Ada, Ruby, and more, support exception handling (EH), which provides a way to react to exceptional circumstances (like runtime errors) in a program by transferring control and information from the exception point to an exception handler. The purpose of EH is to cleanly separate the error handling from the rest of the program logic. The C++ EH feature is used herein to discuss the issues and to illustrate the proposed techniques, but the discussions and techniques are also applicable to the EH support in other languages.
The current C++ EH mechanism is defined with respect to single thread execution. There are certain extensions beyond the current C++ language standard to propagate exceptions across multiple threads. A number of such extensions and how they may be handled on the Fusion System Architecture (FSA), which is an architecture for accelerated processing units (APUs), which combine features of a central processing unit (CPU) and a GPU on a single die and as manufactured by Advanced Micro Devices, Inc., have been previously described.
C++ EH primarily consists of the try, catch, throw, and re-throw constructs. A try block encloses a portion of code under exception inspection. An exception is thrown by using a throw clause from inside a try block. Exception handlers are declared with a catch clause, which is placed immediately after the corresponding try block. If no exception is thrown, the program execution continues normally and all handlers are ignored. Matching a thrown exception object to an exception handler is based on the type specified in the catch clause. If an exception is thrown but is not caught by any immediate catch clause, the exception is propagated to the enclosing try blocks to check against their respective catch clauses. If an exception handler is not located within the current function, the current function returns and the call stack is unwound to the caller to search for a proper exception handler. This process continues until an exception handler is found or the execution is terminated when the search exhausts all call stack frames.
The following is a simple C++ EH example.
Because the function foo( ) throws an exception object with a value of 20, which is an integer, the exception is not caught within the function foo( ) because the exception handler has a float type, and the function foo( ) immediately returns and propagates the exception to the caller. The exception is caught by the handler “catch (int e)” in the function main( ), and the output of the program is 10. If the exception handler in the function foo( ) were “catch (int e)” as noted in the comment, the thrown exception would have been caught and handled by the handler in the function foo( ) The function foo( ) would have returned normally without an exception, and the output would have been 40.
While the C++ language including the EH feature has been fully supported on CPUs for decades, the differences between CPU and GPU architectures as well as the differences in their associated tool chains present new challenges to support some existing C++ language features on a GPU.
Because GPUs use the SIMD (single instruction, multiple data) execution model (e.g., a vector instruction) to support data parallelism (each thread executing with one piece of data), a set of work-items share the same instruction pointer and are executed in lock-steps. But there are times when these work-items want to execute different code paths due to the differences in the data processed by the respective work-items.
Predication is one mechanism to handle such thread divergence, where the predicated-off work-items still execute the same instruction stream along with the predicated-on work-items, except that they do not write any results to affect the architectural states. But predication usually handles only a limited set of control flow divergence found in regular control flow structures. With divergence in more complex control flows, the GPU architecture may serialize the execution of diverged work-items through a pair of specially marked branch and join instructions. Because compilers typically generate codes one function at a time, it is infeasible to place the branch and join instructions in different functions in such cases. This practically limits the support of thread divergence to within a function scope. If the execution of work-items may diverge across a function boundary, this restriction would require the functions to join at the function granularity and then diverge again immediately after function return.
The GPU tool chains are noticeably different from the CPU tool chains in that because the GPU architectures evolve quickly and have a proprietary instruction set architecture (ISA), GPU vendors typically provide an abstraction intermediate representation (as opposed to an actual ISA) to software, where this abstract intermediate representation is stable over multiple generations of GPU architectures.
In the case of FSA, this layer is called FSAIL (FSA Intermediate Language). Because of the gap between FSAIL and the native GPU ISA, a layer of software is required to dynamically translate FSAIL instructions to the native GPU ISA, and this software is called the FSA just-in-time (JIT) compiler. The compiler which generates the FSAIL instructions has a similar role to the typical CPU compiler, and in contrast to the JIT compiler, this compiler is a high-level compiler. Because the JIT compiler translates FSAIL instructions and possibly re-orders the produced native GPU instructions, the FSAIL instruction order produced by the high-level compiler may not be preserved in the JIT-produced native instruction sequence. This potentially inconsistent order presents a challenge if a high-level compiler intends to communicate certain information to the runtime system by relying on the FSAIL instruction order, which is part of the design in a “zero-cost” EH implementation on CPUs (discussed below).
EH is considered a high-productivity language feature instead of a performance feature. Exceptions are expected to occur infrequently and hence the performance of handling exceptions is usually less of a concern. One of the key design issues in implementing EH is to minimize any adverse performance impact when EH constructs are present, but no exceptions are actually thrown, which is expected a common case. Unless a compiler is told otherwise, C++ programs by default have to assume that any function call may throw an exception.
The EH features in C++ and other programming languages have been well-supported on CPUs. Some initial EH implementations simply mapped EH constructs back to setjmp/longjmp instructions, which had been an error handling mechanism predating the general EH language feature. There are a number of issues with the setjmp/longjmp approach. First, it imposes a fair amount of overhead to save and set up information on regular execution paths to be ready to handle exceptions even if no exceptions are eventually thrown. Second, the presence of setjmp/longjmp instructions often require shutting down many subsequent compiler optimizations to preserve the states that need to be saved and restored. In addition, setjmp/longjmp instructions may transfer control flows across a function boundary. While this is not an issue on a CPU, it does not work well with the current GPU data parallel architectures as mentioned above.
In one implementation, the Itanium application binary interface (ABI) Exception Handling Specification defines a methodology for providing outlying data in the form of exception tables, without inlining the testing of exception occurrence to conditionally branch to exception handling code in the flow of an application's main algorithm. Thus, the specification is said to add “zero-cost” to the normal execution of an application.
In the “zero-cost” EH implementation, a C++ compiler generates exception tables stored in data sections of object files and retrieved by the C++EH runtime library when an exception is thrown during program execution. The runtime system first attempts to find an exception frame corresponding to the function where the exception was thrown. The exception frame contains a reference to an exception table describing how to process the exception. If the exception needs to be forwarded to a prior activation (i.e., a caller), the exception frame contains information about how to unwind the current activation and restore the state of the prior activation. An exception handling personality is defined by way of a personality function (e.g., _gxx_personality_v0 in C++), which receives the context of the exception, an exception structure containing the exception object type and value, and a reference to the exception table for the current function.
An exception table is organized as a series of code ranges defining what to do if an exception occurs in that range. Typically, the information associated with a range defines which types of exception objects (using C++ type information) that are handled in that range, and an associated action that should take place. Actions typically pass control to a landing pad. A landing pad corresponds to the code found in the catch portion of a try/catch sequence. When execution resumes at a landing pad, it receives the exception structure and a selector corresponding to the type of exception thrown. The selector is then used to determine which catch clause should process the exception.
While the steps to identify exception handlers and handle exceptions are elaborative in the “zero-cost” EH implementation, the biggest advantage of this approach is that these costs are incurred only when exceptions occur. Normal execution paths, where no exceptions are thrown, have minimal performance impact. Another benefit is that this approach puts a fair amount of work, e.g., stack unwinding and exception frames, into the common ABI of a given architecture. This common support can be shared across the EH features often with slight variations among different programming languages and can reduce the amount of language-specific work. This also allows EH to work when mixing functions written in different languages in an application.
Applying the “zero-cost” EH implementation to a GPU encounters certain issues. First, the unwinding step could lead to threads divergent across a function boundary. This is related to a hardware limitation of GPUs, in that thread divergence on a GPU only works within a function boundary. But, for EH to work properly, it needs to be able to work across function boundaries (to be able to locate an exception handler for a thrown exception).
Second, the FSAIL instructions that are generated by high-level compilers are abstract instructions and may be subsequently re-ordered by the JIT compiler. Checking an exception-throwing instruction against the code ranges tracked in the exception tables generated by the high-level compilers may be problematic, because the re-ordered instructions may not be in the original code range as shown in the exception tables. In contrast, the instruction sequence generated by a CPU compiler is final and checking a given instruction against code ranges in exception tables is not an issue.
A method is described for processing a function in source code by a compiler for execution on a graphics processing unit, wherein the function includes an exception handling structure. The method includes converting an exception raising block into a first control flow and converting an exception handler block into a second control flow. The first control flow includes setting an exception raised indicator and finding an exception handler to process the raised exception. The second control flow includes clearing the exception raised indicator and processing the exception. The exception raised indicator remains set until an appropriate exception handler is found.
A system includes a processor and a compiler executed by the processor to perform operations. The operations performed by the compiler include converting an exception raising block into a first control flow and converting an exception handler block into a second control flow. The first control flow includes setting an exception raised indicator and finding an exception handler to process the raised exception. The second control flow includes clearing the exception raised indicator and processing the exception.
A computer-readable storage medium storing a set of instructions for execution by a computer to process a function in source code for execution on a graphics processing unit, wherein the function includes an exception handling structure. The set of instructions includes a first converting code segment for converting an exception raising block into a first control flow and a second converting code segment for converting an exception handler block into a second control flow. The first control flow includes setting an exception raised indicator and finding an exception handler to process the raised exception. The second control flow includes clearing the exception raised indicator and processing the exception.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
A function in source code is processed by a compiler for execution on a graphics processing unit, wherein the function includes an exception handling structure. An exception raising block is converted into a first control flow and an exception handler block is converted into a second control flow. The first control flow includes setting an exception raised indicator and finding an exception handler to process the raised exception. The exception raised indicator remains set until an appropriate exception handler is found. The second control flow includes clearing the exception raised indicator and processing the exception.
The processor 102 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 104 may be located on the same die as the processor 102, or may be located separately from the processor 102. The memory 104 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 106 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 108 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 110 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 112 communicates with the processor 102 and the input devices 108, and permits the processor 102 to receive input from the input devices 108. The output driver 114 communicates with the processor 102 and the output devices 110, and permits the processor 102 to send output to the output devices 110. It is noted that the input driver 112 and the output driver 114 are optional components, and that the device 100 will operate in the same manner is the input driver 112 and the output driver 114 are not present.
To be an acceptable solution to support EH on a GPU, the solution has to (1) allow excepting and non-excepting work items (i.e., threads) to join their execution at each function boundary (due to the GPU hardware limitations), and (2) have minimal performance overhead when no exceptions are thrown, because the CPU zero-cost case described above cannot be achieved on a GPU.
In this approach, a high-level C++ compiler transforms throw clauses and the functions in a try block that may throw exceptions to a sequence of control flows, which compare the exception object type against each candidate exception handler. If there is a match, a branch instruction jumps to the matched handler to handle the thrown exception and then resumes normal execution. The sequence of checking candidate exception handlers traverses enclosing try blocks and their associated exception handlers from inner to outer scopes. If the exception object type is known at the compilation time, the compiler may simplify the control flows and jump directly to the corresponding handler. It is noted that the compiler also has to generate code to destruct live objects local to the scope that is being exited.
Current GPUs are already capable of dealing with thread divergence under arbitrary control flows expressed in FSAIL instructions within a given function. The JIT compiler translates FSAIL branch instructions into predicated code or special native conditional branch and join instructions. Both excepting threads (those threads that throw an exception) and non-excepting threads have to join back together at a function boundary. When execution reaches a function boundary, the execution waits and does not begin to unwind the stack if there are still exceptions to be handled. The execution waits for non-excepting threads to execute and return (i.e., complete execution). If and only if both the excepting threads and the non-excepting threads reach the return point (e.g., the end of the function), both will return back to the caller. This restriction on returning back to the caller is imposed because of the nature of SIMD execution.
When the excepting thread returns and has not found an error handler for the exception, the thread needs to continue to look for an appropriate error handler. The execution flow returns to the caller, and the excepting and non-excepting threads may diverge again. Branches are used to lead the diverging thread forward. To allow an excepting work-item whose exception has not been handled upon a function return to continue looking for a handler instead of executing the normal code paths as the non-excepting work-items, a reserved FSAIL variable “private_b8 hasexceptionhappened” is defined. This variable (referred to generally herein as an “exception flag”) has a global scope but is unique to each work-item, and it is allocated outside of any function. The convention is to set the exception flag as soon as an exception is raised, and reset the exception flag only after the exception is handled. Upon the return from a call, a work-item needs to check the value of the exception flag. If the flag is set, this work-item needs to follow a code path divergent from the non-excepting work-items to continue searching a proper handler. The JIT compiler has to recognize this special variable and map it to a fixed memory location (or possibly a register, as an optimization). Because the C++EH specification does not allow multiple outstanding exceptions in each thread, a single variable is sufficient for each work-item. This provides the appearance of no thread divergence across function boundaries.
The costs of checking exception types against exception handlers and performing control flow transfers are incurred only when an exception is thrown, except for the case that the exception flag is checked upon a function return even if no exceptions have been thrown. This is an artifact of not allowing thread divergence across function boundaries. This comparison is an unavoidable but acceptable overhead, because functions may continue to be aggressively inlined for GPU offload functions. If a C++ high-level compiler is informed through a compilation option that no exceptions are thrown in an application, it does not need to generate the code to check the exception flag.
An implementation in a C++ high-level compiler is described the following pseudo code. The C++ code is processed, and additional structure and mechanisms are added to the code to perform the EH on the GPU. The resulting code follows the same semantics of the original source code.
This implementation uses a global variable (the exception flag) to indicate that any thread has thrown an exception. In the examples below, this variable is referred to as hasexceptionhappened. It is noted that a person skilled in the art could devise other ways of tracking whether any thread has thrown an exception (e.g., an exception raised indicator), without altering the overall operation of the method. Once an appropriate exception handler has been found, the exception handler resets this variable to indicate that the exception has been handled, and the thread can resume normal execution upon returning to the calling function. The variable will remain set (as indicating that there is an exception that has not yet been handled) upon the excepting thread returning to the caller, unless the exception handler routine resets it.
The method 200 processes all of the try blocks in the current function in a lexical and outer to inner order. A current try block is selected and processed (step 204). The try block processing includes adding a “join label” at the end of the current try block, and is used as an exit point for the current try block.
Each catch clause in the current try block is processed (step 206). This catch clause processing will be described in greater detail in connection with
As part of processing each throw clause or function call in the current try block, any other try blocks contained within the current try block (referred to as “enclosing try blocks”) are visited in an inner to outer order (step 210). Destructor calls are added to a landing pad block associated with the enclosing try block for currently live objects that are local to the enclosing try block being evaluated.
The landing pads as used herein follow the concept from the CPU side, in that they are convenient locations for common branches to go to if an appropriate exception handler for the exception object type cannot be found. The landing pad acts as a placeholder to call a destructor for live objects that are local to the current function (because the function is being exited, this is part of the necessary clean up). After this cleaning up of the function is complete, the next “outer” enclosing scope is checked for an appropriate exception handler for the exception object type. If an appropriate exception handler is not found as the code moves back up the layers of function calls, the landing pads are used at each layer where an appropriate exception handler is not found.
When performing EH on a CPU, the branches are not explicit (e.g., not directly to a landing pad). In a CPU implementation, an EH routine performs a lookup in a table, and if there is no match in the table, then the landing pad is used. This implementation is waiting to pay a high penalty when an exception happens, meaning that the implementation is structured such that when there is no exception (which is the normal case), the code executes efficiently, without table lookups, etc. But in a GPU implementation, the execution flow is streamlined in using conditional branches, not using indirect branches through lookup tables, to jump to the landing pads. This is due to the nature of the SIMD design of a GPU, in which the efficiencies realized in a CPU implementation cannot be utilized in a GPU implementation.
Referring back to
After visiting all of the enclosing try blocks within the current try block, the found handler flag at the current try block level is checked (step 216). This process will be described in greater detail in
The method 200 only imposes a low performance overhead on non-excepting execution paths. A small amount of overhead is added after each function return to check for excepting threads, but does not add any other execution overhead if no exceptions occur. While this approach adds a slight overhead compared to the “zero-cost” approach on CPUs, it is more efficient compared to previous approaches like using setjmp/longjmp instructions.
The method 200 does not rely on any handshake between the FSAIL instructions and the exception tables generated by a high-level compiler as in the CPU “zero-cost” approach. Because the JIT compiler may expand the FSAIL instructions and alter the instruction order, such a handshake is challenging to maintain correctly in the GPU tool chains, where the JIT compiler is an essential component.
The following is an example which applies this approach in translating the C++ EH constructs into control flows on the current GPU architectures.
It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements.
The methods provided may be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the present invention.
The methods or flow charts provided herein may be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).