Contained herein is material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent disclosure by any person as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all rights to the copyright whatsoever.
The present disclosure relates generally to compilation of computation tasks for heterogeneous multiprocessor systems.
A compiler translates a computer program written in a high-level language, such as C++, DirectX, or FORTRAN, into machine language. The compiler takes the high-level code for the computer program as input and generates a machine executable binary file that includes machine language instructions for the target hardware of the processing system on which the computer program is to be executed.
The compiler may include logic to generate instructions to perform software-based prefetching. Software prefetching masks memory access latency by issuing a memory request before the requested value is used. While the value is retrieved from memory—which can take up to 300 or more cycles—the processor can execute other instructions, effectively hiding the memory access latency.
A heterogeneous multi-processor system may include one or more general purpose central processing units (CPUs) as well as one or more of the following additional processing elements: specialized accelerators, digital signal processor(s) (“DSPs”), graphics processing unit(s) (“GPUs”) and/or reconfigurable logic element(s) (such as field programmable gate arrays, or FPGAs).
In some known systems, the coupling of the general purpose CPU with the additional processing element(s) is a “loose” coupling within the computing system. That is, the integration of the system is on a platform level only, such that the software and compiler for the CPU is developed independently from the software and compiler for the additional processing element(s). Typically, the programming model and methodology for the CPU and the additional processing element(s) are quite distinct. Different programming models, such as C++ vs. DirectX may be used, as well as different development tools from different vendors, different programming languages, etc.
In such cases, communication between the various software components of the system may be performed via heavyweight hardware and software mechanisms using special hardware infrastructure such as, e.g., PCIe bus and/or OS support via device drivers. Such approach is challenged and presents limitations when it is desired, from an application development point of view, to treat the CPU and one or more of the additional processing element(s) as one integrated processor entity (e.g., tightly coupled co-processors) for which a single computer program is to be developed. Such approach is sometimes referred to as a “heterogeneous programming model”.
Embodiments provide a compiler for a heterogeneous programming model for a heterogeneous multi-processor system. A compiler generates machine code that includes prefetching and/or scheduling optimizations for code to be executed on a first processing element (such as, e.g., a CPU) and one or more additional processing element(s) (such as, e.g., GPU) of a heterogeneous multi-processor system. Although presented below in the context of heterogeneous multi-processor systems, the apparatus, system and method embodiments described herein may be utilized with homogenous or asymmetric multi-core systems as well.
Although specific sample embodiments presented herein are presented in the context of a computing system having one or more CPUs and one or more graphics co-processors, such illustrative embodiments should not be taken to be limiting. Alternative embodiments may include other additional processing elements instead of, or in addition to, graphics co-processors (also sometimes referred to herein as “GPUs”). Such other additional processing elements may include any processing element that can execute a stream of instructions (such as, for example, a computation engine, a digital signal processor, acceleration co-processor, etc).
In the following description, numerous specific details such as system configurations, particular order of operations for method processing, specific examples of heterogeneous systems, pseudo-code examples of source code and compiled code, and implementation details for embodiments of compilers and library routines have been set forth to provide a more thorough understanding of embodiments of the present invention. It will be appreciated, however, by one skilled in the art that the invention may be practiced without such specific details. Additionally, some well-known structures, circuits, and the like have not been shown in detail to avoid unnecessarily obscuring the present invention
The general purpose processors 2000-200n of the target hardware system 140 may include multiple homogenous processors having the same instruction set architecture (ISA) and functionality. Each of the processors 200 may include one or more processor cores.
For at least one other embodiment, however, at least one of the CPU processing units 2000-200n may be heterogeneous with respect to one or more of the other CPU processing units 2000-200n of the target hardware system 140. For such embodiment, the processor cores 200 of the target hardware system 140 may vary from one another in terms of ISA, functionality, performance, energy efficiency, architectural design, size, footprint or other design or performance metrics. For at least one other embodiment, the processor cores 200 of the target hardware system 140 may have the same ISA but may vary from one another in other design or functionality aspects, such as cache size or clock speed.
Other processing unit(s) 220 of the target hardware system 140 may feature ISAs and functionality that significantly differ from general purpose processing units 200. These other processing units 220 may optionally include, as shown in
For one example embodiment, which in no way should be taken to be an exclusive or exhaustive example, the target hardware system 140 may include one or more general purpose central processing units (“CPUs”) 2000-200n along with one or more graphics processing unit(s) (“GPUs”), 2200-220n. Again, for embodiments that optionally include multiple GPUs, additional such units 2201-220n are denoted in
As indicated above, the target hardware system 140 may include various types of additional processing elements 220 and is not limited to GPUs. Any additional processing element 220 that has characteristics of high parallel computing capabilities (such as, for example, a computation engine, a digital signal processor, acceleration co-processor, etc) may be included, in addition to the one or more CPUs 2000-200n of the target hardware system 140. For instance, at least one other example embodiment the target hardware system 140 may include one or more reconfigurable logic elements 220, such as a field programmable gate array. Other types of processing units and/or logic elements 220 may also be included for embodiments of the target hardware system 140.
The memory storage elements 2100-210n, 2300-230n may be implemented in any known manner. One or more of the elements 2100-210n, 2300-230n may, for example, be implemented as a memory hierarchy that includes one or more levels of on-chip cache as well as off-chip memory. Also, one of skill in the art will recognize that the illustrated memory storage elements 2100-210n, 2300-230n, though illustrated as separate elements, may be implemented as logically partitioned portions of one or more shared physical memory storage elements.
It should be noted, however, that whatever the physical implementation, it is anticipated for at least one embodiment that the memory storage elements 210 of the one or more CPUs 200 are not shared by the GPUs (see, e.g., GPU memory 230). For such embodiment, the CPU 200 and GPU 220 processing elements do not share virtual memory address space. (See further discussion below of the transport layer 904 for the transfer of code and data between CPU memory 210 and GPU memory 230.)
For an application development approach that employs a heterogeneous programming model, the various processing elements 2000-220n, 2200-220n of the target hardware system 140 may be treated as one “super-processor”, with the GPUs 2300-230n viewed as co-processors for the one or more CPUS 2000-220n of the system 140.
Traditionally, a compiler may invoke GPU-type functions through a GPU library that includes routines with support for moving data into and out of the GPU, which are optimized for the architecture of the target hardware system 140. For example, software developers may write library functions that are optimized for the underlying hardware of a GPU co-processor 220. These library functions may include code for complex tasks such as highly complex matrix multiplication that multiplies 10 K×10 K elements, MPEG-3 decoder for audio streaming, etc. The library code is optimized for the architecture of the GPU co-processor on which it is to be executed. Thus, when a compiled application program is executed on CPU 200 of such a “super-processor” 140, the compiled code includes a function call to the appropriate library function, thereby “offloading” execution of the complex processing task to the GPU co-processor 220.
A cost associated with this traditional library-based compilation approach is the latency associated with transferring the data for these complex calculations from the CPU domain (e.g., 930 of
For embodiments of the compiler 120 illustrated in
The compiler can use these techniques to perform latency scheduling optimizations. That is, scheduling can be accomplished by judiciously placing the prefetch instructions into the code stream. In this manner, the compiler can order the process of the instructions in order to allow the CPU to continue processing during the latency associated with loading data or instructions from the CPU to the GPU. One of skill in the art will recognize that this latency avoidance is desirable because the time required to retrieve data from memory is much greater than execution time of a processing unit. For example, an Add or Multiply instruction may take a processing unit only 1-2 cycles to execute, and it may take the processing unit only 1 cycle to retrieve data on a cache hit. But, to retrieve data into memory of the GPU from the CPU or retrieve the results back to the CPU from the GPU may take about 300 cycles. Thus, during the time it takes to load data or instructions into the GPU memory, the CPU could otherwise have performed 300 computations. To alleviate this latency problem, the compiler (e.g., 120 of
A compiler is to compile code written in a particular high-level programming language, such as FORTRAN, C, C++, etc. The compiler is expected to correctly recognize and compile any instructions that are defined in the programming language definition. Any function that is defined by the language specification is referred to as a “predefined” function. An example of a predefined function defined for many high-level programming languages is the cosine function. For this function, when the programmer includes the function in the high-level code, the compiler for the high-level programming language understands exactly how the function the function signature, and what the function should do. That is, for predefined functions for a particular programming language, the language specification describes in detail the spelling and functionality of the function, and the compiler recognizes this and relies on this information. The language specification also defines the data type of the output of the function, so the programmer need not declare the output type for the function in the high-level code. The standard also defines the data types for the input arguments, and the compiler will automatically flag an error if the programmer has provided an argument of the wrong type. A predefined function will be spelled the same way and work the same way on any standard-conforming compiler for the particular programming language. The compiler may, for example, have an internal table to tell it the correct return types or argument types for the predefined function.
In contrast, a traditional compiler does not have this type of internal information for functions that are not predefined for the particular programming language being used and are, instead, calls to a library function. This type of library function call may be referred to herein as a general purpose library call. For such library function calls, the compiler has no internal table to tell it the correct return types or argument types for the function, nor the correct spelling of the function. In such case, it is up to the programmer to declare the function of the correct type, and to provide arguments of the correct type. As a result, programmer errors for these data types will not be caught by the compiler at compile-time. Also as a result, prefetching optimizations are not performed by the compiler for such general purpose library function calls.
We refer briefly back to
In order to achieve this desired result, certain modifications are made to the compiler 120 for one or more embodiments of the present invention. For predefined functions that are to be executed on a CPU, the compiler is aware that a function has an in and out data set. For these predefined functions, the compiler has innate knowledge of the function and can optimize for it. Such predefined functions are treated by the compiler differently from a “general purpose” functions. Because the compiler knows more about the predefined function, the compiler can take that information into account for scheduling and prefetch optimizations during compilation.
The modified compiler 120 takes function calls that might ordinarily be compiled as general purpose library calls for the GPU, and instead treats them like native CPU instructions (so-called “foreign macro instructions”) in terms of scheduling and optimizations that the compiler 120 performs. Thus, the compiler 120 illustrated in
The compiler 120, which has been modified to support a heterogeneous compilation model, creates both the CPU machine code stream 330 and GPU machine code stream 340 into one combined “fat” program image 300. The combined program image 300 includes at least two segments: the segment 330 that includes the compiled code for the regular native CPU code sequences (see, e.g., 301 and 305) and the segment 340 that includes the compiled code for the “foreign” macro-instruction sequences (see, e.g., 302 and 304).
The foreign code sequences are treated by the compiler as if they are extensions to the instruction set of the CPU, so-called “foreign macro-instructions”. Accordingly, the compiler 120 may perform prefetch optimizations for the foreign macro-instructions that would not have been possible if the compiler had compiled the foreign code sequences as general purpose library function calls.
At block 408, however, special processing takes place for the foreign code. Responsive to the pragma or other compiler directive, the foreign code is compiled as a foreign macro-instruction. (The processing of block 408 is discussed in further detail below in connection with
From blocks 406 and 408, processing proceeds to block 409. If there are more high-level instructions from the source code 102 to be compiled, processing returns to block 404; otherwise, processing proceeds to block 410.
At block 410, the compiler performs scheduling and/or prefetch optimizations on the code that contains the foreign macro-instructions. The result of block 410 processing is the generation of a single program image 104 similar to the image 300 of
Turning to
The run-time support functions illustrated in
The code-prefetch, data-prefetch and execute functions for the GPU may be implemented in the compiler as macro-instructions that are predefined for the CPU, rather than as general purpose runtime library function calls. They are abstracted to be functionally similar to well-established instructions and functions of the CPU. As a result, the compiler (see, e.g., 120 of
Thus, the compiler operates (see, e.g., block 408 of
For the example pseudocode shown in
The function GPUlaunch( ) is executed by the CPU to cause the macro-instruction code to be executed by the GPU. For the example pseudo-code illustrated in
The function GPUwait( ) is used to sync back (join) the control flow for the CPU. That is, the GPUwait( ) function effects cross-processor communication to let the CPU know that the GPU has completed its work of executing the foreign macro-instruction indicated by a previous GPUlauch( ) function. The GPUwait( ) function may cause a stall on the CPU side. Such run-time support function may be inserted by the compiler in the CPU machine code, for example, when no further parallelism can be identified for the code 102 section, such that the CPU needs to results of the GPU operation before it can proceed with further processing.
The functions GPUrelease( ) and GPUfree( ) de-allocate the code and data areas on the GPU. These are housekeeping functions that free up GPU memory. The compiler may insert one or more of these run-time support functions into the CPU code at some point after a GPUInject( ) or GPUload( ) function, respectively, if it appears that the injected code and/or data will not be used in the near future. These housekeeping functions are optional and are not required for proper operation of embodiments of the heterogeneous pre-fetching techniques described herein.
While the runtime support function calls referred to above are presented as function calls, they are not treated by the compiler as general purpose library function calls. Instead, the compiler treats them as predefined CPU functions in terms of scheduling and optimizations that the compiler performs for these foreign operations. Thus,
For example, calls to GPLUload( )/GPUfree( ) may be subject to load-store optimizations by the compiler. Also for example, whole program optimization techniques in combination with detection of common code sequences can be used by the compiler to eliminate GPUinject( )/GPUrelease( ) pairs.
Also, for example, the compiler may employ interleaving of load and launch function calls to achieve desired scheduling effects. For example, the compiler may interleave the load and launch function calls 816, 812, 813 of
Another scheduling-related optimization that may be performed by the compiler is to utilize any multithreading capability of the GPU. As is illustrated in
To summarize, the compiler 102 (
Such traditional compiler optimization techniques may include any techniques to help code run faster, use less memory, and/or use less power. Such optimizations may include loop, peephole, local, and/or intra-procedural (whole program) optimizations. For example, the compiler can employ compilation techniques that utilize loop optimizations, data-flow optimizations, or both, to effect efficient scheduling and code placement.
For at least one embodiment, the macro-instruction transport layer 904 may include a library 907 which includes GPU machine instructions to perform the required functionality to effectively inject the GPU code sequence (see, e.g., 820) corresponding to the macro-instruction 906 (see, e.g., 814 or 816) or load the data 909 into the GPU memory 230. The foreign macro-instruction transport layer library 907 may also provide the GPU machine language instructions for the functionality of the other run-time support functions such as “launch”, “release”, and “free” functions.
The macro-instruction transport layer 904 may be invoked, for example, when the CPU 200 executes a GPUinject( ) function call. This invocation results in code prefetch into the GPU memory system 230; this system 230 may include an on-chip code cache (not shown). Such operation provides that the proper code (see, e.g., 820 of
For at least one embodiment, the foreign macro-instruction runtime system 906 runs on the GPU 220 to control execution of the various macro-instruction code injected by one or more CPU clients. The runtime may include a scheduler 914, which may apply its own caching and scheduling policies to effectively utilize the resources of the GPU 220 during execution of the foreign code sequence(s) 910.
Embodiments may be implemented in many different system types. Referring now to
Each processing element may include a single core or may, alternatively, include multiple cores. The processing elements may, optionally, include other on-die elements besides processing cores, such as integrated memory controller and/or integrated I/O control logic. Also, for at least one embodiment, the core(s) of the processing elements may be multithreaded in that they may include more than one hardware thread context per core.
The GMCH 520 may be a chipset, or a portion of a chipset. The GMCH 520 may communicate with the processor(s) 510, 515 and control interaction between the processing element(s) 510, 515 and memory 530. The GMCH 520 may also act as an accelerated bus interface between the processing element(s) 510, 515 and other elements of the system 500. For at least one embodiment, the GMCH 520 communicates with the processing element(s) 510, 515 via a multi-drop bus, such as a frontside bus (FSB) 595.
Furthermore, GMCH 520 is coupled to a display 540 (such as a flat panel display). GMCH 520 may include an integrated graphics accelerator. GMCH 520 is further coupled to an input/output (I/O) controller hub (ICH) 550, which may be used to couple various peripheral devices to system 500. Shown for example in the embodiment of
Alternatively, additional or different processing elements may also be present in the system 500. For example, additional processing element(s) 515 may include additional processors(s) that are the same as processor 510 and/or additional processor(s) that are heterogeneous or asymmetric to processor 510, such as accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the physical resources 510, 515 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 510, 515. For at least one embodiment, the various processing elements 510, 515 may reside in the same die package.
Referring now to
One or more of processing elements 670, 680 may be an element other than a CPU, such as a graphics processor, an accelerator or a field programmable gate array. For example, one of the processing elements 670 may be a single- or multi-core general purpose processor while another processing element 680 may be a single- or multi-core graphics accelerator, DSP, or co-processor.
While shown in
First processing element 670 may further include a memory controller hub (MCH) 672 and point-to-point (P-P) interfaces 676 and 678. Similarly, second processing element 680 may include a MCH 682 and P-P interfaces 686 and 688. As shown in
First processing element 670 and second processing element 680 may be coupled to a chipset 690 via P-P interconnects 676, 686 and 684, respectively. As shown in
In turn, chipset 690 may be coupled to a first bus 616 via an interface 696. In one embodiment, first bus 616 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.
As shown in
Referring now to
For at least one embodiment, the CL 672, 682 may include memory controller hub logic (MCH) such as that described above in connection with
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented as computer programs executing on programmable systems comprising at least one processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code, such as code 630 illustrated in
Such machine-accessible storage media may include, without limitation, tangible arrangements of particles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritable's (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
The output information may be applied to one or more output devices, in known fashion. For purposes of this application, a processing system includes any system that has a processor, such as, for example; a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
The programs may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The programs may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
Presented herein are embodiments of methods and systems for compiling code for a heterogeneous system that includes both one or more primary processors and one or more parallel co-processors. For at least one embodiment, the primary processors(s) include a CPU and the parallel co-processor(s) include a GPU. An optimizing compiler for the heterogeneous system comprehends the architecture of both processors, and generates an optimized fat binary that includes machine code instructions for both the primary processor(s) and the co-processor(s); the fat binary is generated without the aid of remote procedure calls for foreign code sequences (referred to herein as “macro-instructions”) to be executed on the GPU. The binary is the result of compiler optimization techniques, and includes prefetch instructions to load code and/or data into the GPU memory concurrently with execution of other instructions on the CPU. While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that numerous changes, variations and modifications can be made without departing from the scope of the appended claims. Accordingly, one of skill in the art will recognize that changes and modifications can be made without departing from the present invention in its broader aspects. The appended claims are to encompass within their scope all such changes, variations, and modifications that fall within the true scope and spirit of the present invention.