This application claims priority to Taiwan Application Serial Number 109146968, filed on Dec. 30, 2020, which is herein incorporated by reference in its entirety.
The present disclosure relates to the field of a compiler, and more particularly to a compiler adapted in a graphics processing unit.
In recent years, with the rise of Internet of things (IOT) and the rapid development of artificial intelligence, machine learning and other fields, the amount of data processing has increased significantly. The traditional cloud computing has been unable to cope with such a large amount of real-time data processing, and thus has been replaced by the application architecture of distributed computing (e.g., fog computing, edge computing, end user computing). For example, edge computing moves computations of applications, data, and services from the central node of the network to the logical edge node of the network for processing. In other words, edge computing decomposes the large-scale services that were originally processed by the central node into small and many manageable parts, and distributes them to the edge nodes for processing. The edge node is close to the user terminal device, thus speeding up the data processing and transmission and reducing the delay.
Therefore, a general-purpose graphics processing unit (GPGPU) has been widely used in such applications that need to compute a large amount of data and can be highly parallelized. In addition to processing graphics data, such graphics processing unit can also be used to calculate the general computing tasks that were originally processed by a CPU, and was generally not associated with graphics processing. Due to the powerful parallel processing capability and programmable pipelines of the modern graphics processing unit, the performance of the GPGPU can greatly surpass that of the traditional CPU, for processing single instruction multiple data (SIMD) on the condition that the computation of data processing is much larger than that of data scheduling and transmission.
However, most GPUs use manufacturers' system architectures and compilers themselves, which usually only support applications with their own defined architectures and languages. Even if these manufacturers have released some support services for open-source software, compilers and other related software or hardware still have to use their definitions. For example, the traditional open computing language (OpenCL) compiler is AMD CLOC, which is closed source software and is only provided for X86 platform. In other words, developers are unable to modify, add instructions to, and optimize them. Therefore, there are some difficulties in the development and use. Therefore, how to provide a portable OpenCL compiler platform and an optimized compiler to improve the performance of graphics processors supporting OpenCL is a current topic.
One of the objectives of the present disclosure is to provide a compiler adapted in a graphics processing unit and a non-transitory computer-readable storage medium.
To achieve the aforementioned objectives, the present disclosure provides a compiler adapted in a graphics processing unit for general purpose, which is configured to compile an application program executed by the graphics processing unit to generate a machine code corresponding to the application program for execution by a plurality of stream multiprocessors of the graphics processing unit. The compiler includes a front-end module, an optimization module, and a back-end module. The front-end module is configured to perform a pre-processing on a source code corresponding to the application program to generate an intermediate code. The optimization module is configured to perform an optimization processing on the intermediate code. The back-end module is configured to perform a translation processing on the optimized intermediate code to generate the machine code. The optimization processing includes translating each branch instruction in the intermediate code into performing the following operations: establishing a post dominator tree for the branch instruction to find an immediate post dominator of the branch instruction serving as a reconverge point of instructions of a first path and a second path of the branch instruction; and inserting a specific instruction at the front end of the reconverge point, so as to jump to execute the instructions of the second path of the branch instruction when the instructions of the first path of the branch instruction are executed, once the specific instruction on the first path is executed, wherein the instructions following the reconverge point is not executed until the specific instruction on the second path is executed.
In one embodiment of the present disclosure, the branch instruction is executed by a plurality of stream processors comprised in issued one of the stream multiprocessors simultaneously, wherein the instructions of the first path are executed by a plurality of first stream processors and a plurality of second stream processors of the stream processors simultaneously by using a first lane mask, and the instructions of the second path are executed by the first stream processor and the second stream processors simultaneously by using a second lane mask.
In one embodiment of the present disclosure, once the specific instruction on the first path is executed, only the results of the execution by the first stream processors are stored, and once the specific instruction on the second path is executed, only the results of the execution by the second stream processors are stored.
In one embodiment of the present disclosure, when the instructions of the first path of the branch instruction are executed, once the specific instruction is executed, the use of the first lane mask is ended; and when the instructions of the second path of the branch instruction are executed, once the specific instruction is executed, the use of the second lane mask is ended.
In one embodiment of the present disclosure, the optimization processing further includes translating each call instruction in the intermediate code into performing the following operation: inlining all contents of the function called by the call instruction directly in the caller using the call instruction.
In one embodiment of the present disclosure, the optimization processing further includes translating each loop instruction in the intermediate code into performing the following operations: analyzing the number of the loops for the loop instruction; and unrolling all instructions executed in the loop instruction according to the number of the loops.
In one embodiment of the present disclosure, the front-end module is a clang compiler, which is configured to generate the intermediate code defined by a low level virtual machine (LLVM).
In one embodiment of the present disclosure, the pre-processing includes macro processing, static analysis, and generating a syntax tree corresponding to the source code.
The present disclosure further provides a non-transitory computer-readable storage medium, which is configured to store a plurality of instructions, the instructions are executed by a processor in a computer system so that the processor executes a compiling method to compile an application program executed by a graphics processing unit in the computer system to generate a machine code corresponding to the application program for execution by a plurality of stream multiprocessors of the graphics processing unit, wherein the compiling method includes: performing a pre-processing on a source code corresponding to the application program to generate an intermediate code; performing an optimization processing on the intermediate code; and performing a translation processing on the optimized intermediate code to generate the machine code; wherein the optimization processing includes translating each branch instruction in the intermediate code into performing the following operations: establishing a post dominator tree for the branch instruction to find an immediate post dominator of the branch instruction serving as a reconverge point of instructions of a first path and a second path of the branch instruction; and inserting a specific instruction at the front end of the reconverge point, so as to jump to execute the instructions of the second path of the branch instruction when the instructions of the first path of the branch instruction are executed, once the specific instruction on the first path is executed, wherein the instructions following the reconverge point is not executed until the specific instruction on the second path is executed.
In the present disclosure, by optimizing the aforementioned branch-related instruction, call instruction and loop instruction in the compiling process, the software stack can effectively match the operation of the hardware, and greatly improve the overall performance, so as to provide a convenient open-source execution environment for developers.
Reference will now be made in detail to embodiments of the present disclosure, examples of which are described herein and illustrated in the accompanying drawings.
Reference is made to
A thread is the smallest unit of a program executed by the GPU 100, and its scheduling is issued through two different scheduling modules, namely, the work group scheduling module 130 and the warp scheduling module 121. When the CPU issues a new work, the work scheduling module 130 receives the program to be executed in the unit of thread grid, cuts and schedules it, and then issues it to each stream multiprocessor 120 in the unit of thread block for execution. After receiving a thread block, a stream multiprocessor 120 divides a thread block into multiple warps according to the width of single instruction multiple data (SIMD), and perform computations in the unit of the warp. Multiple warps are scheduled by the warp scheduling module 121 and issued to each stream processor 122 for execution. Multiple threads in the same warp are executed simultaneously by the stream processors 122 of the stream multiprocessor 120. For example, if the stream multiprocessor 120 includes 32 stream processors 122 (i.e., the width of SIMD is 32), each warp has 32 threads as far as possible, which are executed by 32 stream processors 122 in parallel at the same time. If fewer than 32 threads are in one warp, some corresponding stream processors 122 do not work at the moment. It should be understood that the program running on a graphics processing unit is generally called a kernel, and a kernel corresponds to a thread grid. Each thread grid includes multiple thread blocks, and each thread block includes multiple threads.
Reference is made to
However, if the software level of the GPGPU 100 is not supported by the compiler, the whole system platform of the GPGPU 100 is unable to be established completely. Therefore, the compiler plays a very important role in the whole software and hardware system. In the present disclosure, the compiler 240 is an OpenCL LLVM compiler to support the GPGPU 100. The compiler 240 can optimize and customize its own instruction set, so as to achieve an effective cooperation between hardware and software and further improve the execution efficiency.
Specifically, for the TensorFlow runtime 210, in order to enable the TensorFlow applications to be executed under the OpenCL architecture, first it is necessary to understand the collocation scheme of the TensorFlow stream executor and the TF-Coriander. The TensorFlow stream executor is a common interface of the kernel application interface defined by Google for TensorFlow. In the architecture concept, the stream executor is used as the hardware abstraction layer of each target platform. The upper kernel application may perform the commands related to the resource management, such as memory allocation, instruction issue, and kernel process monitoring, on the virtual device through the common interface. Each platform developer can also put platform-related optimization programs into kernel implementation to optimize the execution efficiency of each kernel on the platform.
The native TensorFlow GPU Support only supports the GPU devices using CUDA programming language. For other platforms, developers need to design their own stream executors for the target platform. Since TensorFlow provides many kinds of kernel operations, much manpower may be cost to provide complete support for the platform, and synchronizing and maintaining TensorFlow may be difficult upon being updated. In order to reduce the complexity of the new hardware, a CUDA-on-CL architecture is proposed, which uses Coriander's source-to-source compiler to translate the native CUDA application program into the host code and device code that are able to be executed by the OpenCL device, so as to convert the native CUDA code of TensorFlow into OpenCL device kernel and design a stream executor for OpenCL, which is an independent branch of TensorFlow, that is, TF-Coriander.
TF-Coriander translates the CUDA code built in TensorFlow into OpenCL device kernel code through a coriander compiler, uses OpenCL libraries, such as clBLAST and DNN, to substitute for cuBLAST and cuDNN in CUDA, and establishes TensorFlow supporting OpenCL devices for OpenCL 1.2 devices.
In addition, for the HSA runtime 230, the modern computation platform is generally composed of heterogeneous hardware such as CPU, GPU or ASIC. Therefore, Apple proposes an open source language framework, that is, Open Computing Language (OpenCL). OpenCL provides a unified abstract software architecture and language for different hardware architectures, and uses the same application interface to connect with the target hardware for providing functions, such as device memory allocation, device kernel compilation and device code dispatching. In order to support each platform hardware, the OpenCL runtime is implemented in the form of shared library (Linux)/dynamic loadable Library (NT) in the software architecture. Each hardware developer may implement the application program interface for its hardware according to OpenCL specification.
OpenCL application architecture divides code into the host code and the device code (kernel). Most of the content executed by the host code is composed of C++ classes and runtime API provided by the OpenCL runtime. For the GPU/accelerator and other target devices, the OpenCL kernel code needs to be written separately, and the design for the dispatched kernel complies with OpenCL programming mode. OpenCL kernel code is a programming language based on C99, which provides parallel computing capability of task partition/data partition with kernel application program interface.
For the HSA runtime 230, in order to integrate hardware platforms with different architectures such as CPU, GPU, and DSP, HSA Foundation proposes a software architecture of heterogeneous system architecture (HSA). Similar to OpenCL, which provides a common parallel computing software development framework, HSA aims to provide a common hardware interface. Unlike OpenCL, which standardizes a unified application program development interface, HSA standardizes a unified hardware operation interface to simplify the development complexity of bridging interface between the upper layer (e.g., OpenCL) and the lower layer.
In the present embodiment, in order to provide the special computation instructions supported by OpenCL kernel application and GPGPU 100, it is necessary to provide a device library 250 additionally to cooperate with the compiler 240. The device library 250 includes an OCKL module 251, an OCML module 252, and an OpenCL module 253. The OCKL module 251 is configured to provide an application program interface including the related parameters (e.g., work item ID, thread block size, thread grid size, etc.) required by running the kernel. The OCML module 252 is configured to provide an application program interface related to mathematical calculations. The OpenCL module 253 is configured to provide an OpenCL kernel application interface corresponding to the functions of the OCKL module 215 and OCML module 252. Through the device library 250, the compiler 240 can provide the resources related to OpenCL kernel application interface for developers to use its internal special operation instruction set.
Reference is made to
In the present embodiment, the compiler 240 uses LLVM (low level virtual machine) architecture as the development platform. LLVM takes componentization as the design goal in the compiler architecture design, and divides each compiling function into individual corresponding sub modules. As a result, the core components of the compiler are able to be shared between different languages and different target architectures, in which the intermediate data transmission mechanism adopts the intermediate language defined by LLVM (LLVM-IR), which is a high level abstract intermediate code not associated with the platform and is able to be used by the front-end module 310 and the back-end module 330.
Specifically, the front-end module 310 is responsible for language-related processing. For example, the front-end module 310 can translate the source code to generate the internal-required abstract syntax tree (AST) data structure, pre-process the source code, and then translate the processed source code to generate the aforementioned LLVM-IR for the back-end module 330 to process. The pre-processing may include macro processing, static analysis, and so on. Macro processing includes the functions related to language specification, such as item expansion, constant term processing, and so on. Static analysis is to analyze the characteristics of the code, such as program size, the use of variables, program complexity, performance, and so on.
In the present embodiment, the front-end module 310 may be a Clang compiler to generate the corresponding LLVM-IR. In one embodiment, Clang can first perform the aforementioned pre-processing on the source code, and then translate the source code into the syntax tree defined by Clang (Clang AST) through token based Parser. After generating Clang AST, Clang can perform the language-related optimization on it and transform it into LLVM-IR.
The optimization module 320 can optimize LLVM-IR, such as constant pre-processing, conditional optimization and other language-dependent optimizations.
The back-end module 330 is configured to integrate the LLVM-IR instructions generated by the front-end module 310 and the optimization module 320, and generate the target-executable instructions and file formats. In other words, the back-end module 330 can translate the LLVM-IR to generate the machine code/file executable by the stream multiprocessor 120 of the GPGPU 100.
In the present disclosure, for some instructions included in the intermediate code (i.e., LLVM-IR), the optimization module 320 of the compiler 240 will further perform an optimization processing on them, as described below.
In one embodiment, when the intermediate code includes a “branch” instruction, the optimization module 320 can perform the optimization processing on it to translate it into the corresponding machine code performing the following operations: establishing a post dominator tree on the branch instruction to find an immediate post dominator of the branch instruction as the reconverge point of the instructions of the first path and the instructions of the second path of the branch instruction; and inserting a specific instruction (e.g., jump instruction) at the front end of the reconverge point, so as to jump to execute the instructions of the second path of the branch instruction rather than continuing to execute the remaining instructions following the reconverge point when the instructions of the first path of the branch instruction are executed, once the specific instruction on the first path is executed, wherein the remaining instructions following the reconverge point are not executed until the specific function on the second path is executed.
Reference is made to
Take the branch instruction 400 in
In the example of
Reference is made to
In one embodiment, when the intermediate code includes a “call” instruction, the optimization module 320 can perform the optimization processing on the call instruction to translate it into the corresponding machine code to perform the following operations: inlining all the contents of the function called by the call instruction directly in the caller using the call instruction function. Since the call instruction results in the complex divergence problem, the hardware cost is increased, and the efficiency is diminished. Therefore, when the compiler 240 of the present disclosure processes the call-related instructions, the designated function body may be directly inserted into and replace every place where the function is called, and, that is, the contents of the called function may be directly inlined inside the caller, so as to avoid the divergence and save the extra time brought by each call function.
In one embodiment, when the intermediary code includes a “loop” instruction (e.g., “loop” instruction, “for” instruction, etc.), the optimization module 320 can perform the optimization processing on the loop instruction to translate it into the corresponding machine code to perform the following operations: analyzing the number of the loops for the loop instruction; and unrolling all the instructions executed in the loop instruction according to the number of the loops. The branch instruction results in the divergence. Therefore, the stream multiprocessor blocks the dispatch of all instructions following the branch instruction when facing the branch instruction. The stream multiprocessor does not execute the branch instruction until the instructions in the pipeline are all completed, and does not continue to dispatch the following instructions until jumping to the designated target, which results in the decrease of pipeline efficiency. In order to reduce the number of the instructions required by the branch instructions, the loop unrolling method is used in the present embodiment to unroll all instructions in the loop instruction by the number of the loops thereof on the condition of available resources, so as to reduce the proportion of the branch instructions in the loop instruction during execution.
To sum up, the general graphics processing unit provided by the present disclosure designs the runtime for the graphics processing unit and the corresponding OpenCL LLVM compiler according to the OpenCL specification, so as to provide an application program interface conforming to and supporting OpenCL/TensorFlow. Moreover, by optimizing the aforementioned branch-related instructions, call instructions and loop instructions in the compiling process, the software stack can better match the operation of the hardware, and greatly improve the overall performance, so as to provide a convenient open-source execution environment for developers.
Although the present disclosure has been disclosed by way of preferred embodiments, the above preferred embodiments are not intended to limit the present disclosure, and one of ordinary skill in the art, without departing from the spirit and scope of the invention, the scope of protection of the present disclosure is defined by the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
109146968 | Dec 2020 | TW | national |