Computer systems often include one or more central processing units (CPUs) and one or more data parallel devices (e.g., graphics processing units (GPUs)). CPUs and data parallel devices typically operate using different instruction sets defined by their respective architectures such that CPU instructions may not be executable on data parallel devices and vice versa. CPUs generally perform all general purpose processing on computer systems, and data parallel devices generally perform data parallel processing (e.g., graphics processing) on computer systems.
Because of their different instructions sets and functions, CPUs and data parallel devices are often programmed using different high-level programming languages. For example, a CPU may be programmed using general purpose programming languages such as C or C++, and a data parallel device, such as a graphics processing unit (GPU), may be programmed using data parallel device programming languages, such as HLSL, GLSL, or Cg. Data parallel device programming languages, however, often have limitations that are not found in CPU programming languages. These limitations stem from the supporting role that data parallel devices have played to CPUs in executing programs on computer systems. As the role of data parallel devices increases due to enhancements in data parallel device processing capabilities, it would be desirable to enhance the ability of programmers to program data parallel devices.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
A compile environment is provided in a computer system that allows programmers to program both CPUs and data parallel devices (e.g., GPUs) using a high level general purpose programming language that has data parallel (DP) extensions. A compilation process translates modular DP code written in the general purpose language into DP device source code in a high level DP device programming language using a set of binding descriptors. A binder generates a single, self-contained DP device source code unit from the set of binding descriptors. A DP device compiler generates a DP device executable for execution on one or more data parallel devices from the DP device source code unit.
The accompanying drawings are included to provide a further understanding of embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain principles of embodiments. Other embodiments and many of the intended advantages of embodiments will be readily appreciated as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.
In the following Detailed Description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. In this regard, directional terminology, such as “top,” “bottom,” “front,” “back,” “leading,” “trailing,” etc., is used with reference to the orientation of the Figure(s) being described. Because components of embodiments can be positioned in a number of different orientations, the directional terminology is used for purposes of illustration and is in no way limiting. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims. It is to be understood that the features of the various exemplary embodiments described herein may be combined with each other, unless specifically noted otherwise.
GP executable 32 represents a program intended for execution on one or more processors (e.g., central processing units (CPUs)). GP executable 32 includes low level instructions from an instruction set of one or more central processing units (CPUs). GP executable 32 may also include one or more DP device executables 40. A DP device executable 40 represents a data parallel program (e.g., a shader) intended for execution on one or more data parallel (DP) devices such as DP device 210 shown in
GP code 12 includes a sequence of instructions of a high level general purpose programming language with data parallel extensions (hereafter GP language) that form a program stored in a set of one or more modules. The GP language allows the program to be written in different parts (i.e., modules) such that each module may be stored in separate files or locations accessible by the computer system. The GP language provides a single language for programming a computing environment that includes one or more general purpose CPUs and one or more special purpose DP devices. Using the GP language, a programmer may include both CPU and DP device code in GP code 12 for execution by CPUs and DP devices, respectively, and coordinate the execution of the CPU and DP device code. GP code 12 may represent any suitable type of code, such as an application, a library function, or an operating system service.
In one embodiment, the GP language may be formed by extending a widely adapted, high level, and general purpose programming language such as C or C++ to include data parallel features. The GP language includes rich linking capabilities that allow different parts of a program to be included in different modules as shown in
GP code 12 includes one or more portions 14 in one or more modules with code designated for execution on a DP device. In one embodiment, the GP language allows a programmer designate a portion 14 of GP code 12 as DP device code using an annotation 16 (e.g., _declspec(vector) . . . ) when defining a kernel function (also referred to as a vector function). The annotation 16 is associated with a function name 17 (e.g., kernel_func) of the kernel function that is intended for execution on a DP device. Code portions 14 may also include one or more invocations 18 of the kernel function (e.g., forall . . . , kernel_func, . . . ). The kernel function may call other kernel functions in GP code 12 (i.e., other DP device code) and may use types (e.g., classes or structs) defined by GP code 12. The types may or may not be annotated as DP device code. In other embodiments, other suitable programming language constructs may be used to designate portions 14 of GP code 12 as DP device code and/or CPU code.
Compile environment 10 includes a GP compiler 20 and a linker 30. GP compiler 20 is configured to compile GP code 12, where GP code 12 is written in a GP language, stored in one or more modules, and includes both CPU code and DP device code. GP compiler 20 may be formed by extending the compiler functionality of a widely adapted, high level, and general purpose programming language compiler, such as a C or C++ compiler, to have the ability to compile both CPU code and DP device code in GP code 12.
For CPU code in GP code 12, GP compiler 20 compiles the one or more modules with CPU code into one or more object or intermediate representation (IR) files 22 with symbols that identify the relationships between the one or more object or IR files 22. Linker 30 receives the objects or files 22 and combines the objects or files 22 into an GP executable 32 and resolves the symbols between the one or more object or IR files 22. GP executable 32 includes low level instructions from an instruction set defined by a CPU. Accordingly, GP executable 32 is directly executable by one or more CPUs that implement the instruction set.
For DP device code in portions 14 of GP code 12, GP compiler 20 and linker 30 combine to generate a single, self-contained DP device source code unit 36 (e.g., a file or a string) in a high level data parallel (DP) device language for each invocation 18 in each portion 14 of GP code 12. Linker 30 provides each DP device source code unit 36 to a DP device compiler 38. DP device compiler 38 is configured to compile code written in a high level DP device programming language such as HLSL (High Level Shader Language) rather than code written in the GP language of GP code 12. In one embodiment, GP compiler 20 translates portions 14 from the GP language into the high level DP device programming language for later inclusion in DP device source code unit 36 by a binder 34 in linker 30. In another embodiment, GP compiler 20 translates portions 14 from the GP language into an intermediate representation (IR) and binder 34 translates the IR into the high level DP device programming language for inclusion in DP device source code unit 36.
In addition, DP device compiler 38 includes limited or no linking capability. To operate with this single module mode of DP device compiler 38, GP compiler 20 and linker 30 generate the DP device source code unit 36 for each invocation 18 to be fully self-contained—i.e., include all DP device source code for kernel functions and types that stem from a corresponding invocation 18 in a portion 14 of GP code 12.
In particular, GP compiler 20 separately translates each invocation 18, kernel function, and type into DP intermediate code (i.e., DP device source code or IR) in a set of binding descriptors 24 along with other binding information. Linker 30 includes binder 34 that binds the DP intermediate code from the set of binding descriptors 24 into a DP device source code unit 36 by traversing the call graph rooted from an invocation 18 and formed by the set of binding descriptors 24, translating DP intermediate code into DP device source code (if necessary), and concatenating the DP device source code from the set of binding descriptors 24. The functions of binder 34 may be performed by binder 34 statically if all needed DP intermediate code is available or dynamically at runtime. DP device compiler 38 compiles each DP device source code unit 36 with high level instructions from the high level DP device language into a corresponding DP device executable 40 with byte code or low level instructions from an instruction set of a DP device that is intended for execution on a DP device.
Although shown separately from GP compiler 20 and linker 30 in the embodiment of
GP compiler 20 uses a naming convention for kernel functions and types used in the DP intermediate code. The naming convention ensures that a unique name is used for each kernel function and type and that the unique name is used consistently for each instance of a function and a type. In addition, GP compiler 20 uses a naming convention for names used for identifying binding descriptors 24. This naming convention allows binding descriptors 24 to be uniformly referenced in import tables 24D based on locally available information. The naming conventions may be based on the names of the kernel functions and types in GP code 12.
Additional details of the process of compiling one or more DP device code portions 14 in GP code 12 into a DP device executable 40 will now be described with reference to
In the embodiment described with reference to
GP compiler 20 performs the method of
For an invocation 18, GP compiler 20 translates the DP code of the invocation 18 from the GP language into DP intermediate code that is used to setup the call to the invoked kernel function. GP compiler 20 stores this DP intermediate code into DP intermediate code 24C in an invocation stub binding descriptor 24 for the invocation site 18 along with references to the declaration and definition binding descriptors 24 of the invoked kernel function and references to the declaration binding descriptors 24 of any types used by the invocation site in import table 24D.
For a kernel function, GP compiler 20 generates a declaration binding descriptor 24 and a definition binding descriptor 24. GP compiler 20 generates a declaration binding descriptor 24 that includes the DP intermediate code for declaring the kernel function in DP intermediate code 24C and references to the declaration binding descriptors 24 of any types used in the declaration of the kernel function in import table 24D. GP compiler 20 also generates a definition binding descriptor 24 that includes the DP intermediate code for defining the kernel function in DP intermediate code 24C, references to declaration and definition binding descriptors 24 of any called kernel functions in import table 24D, references to the declaration binding descriptors 24 of any types used by the kernel function in import table 24D, and references to the definition binding descriptors 24 of any member functions used by the kernel function in import table 24D.
In response to GP compiler 20 being invoked to compile module B.cpp in the example of
GP compiler 20 also identifies kernel function Hoo 16(2) in module B.cpp. GP compiler 20 generates a declaration binding descriptor 24(4) and a definition binding descriptor 24(5) in performing the function of block 52 of
Thus, for module B.cpp, GP compiler 20 generates binding descriptors 24(2) and 24(3) for kernel function Foo and binding descriptors 24(4) and 24(5) for kernel function Hoo in B.cpp.
In response to GP compiler 20 being invoked to compile module C.cpp in the example of
GP compiler 20 also identifies kernel function Hoo 16(4) in module C.cpp. GP compiler 20 generates a declaration binding descriptor 24(8) and a definition binding descriptor 24(9) in performing the function of block 52 of
Thus, for module C.cpp, GP compiler 20 generates binding descriptors 24(6) and 24(7) for kernel function Boo and binding descriptors 24(8) and 24(9) for kernel function Hoo in C.cpp.
Referring back to
With reference to the example of
The functions performed by binder 34 in one embodiment will now be described with reference to
In the example of
The above embodiments may close a gap between general purpose languages with rich linking capabilities and DP device languages with little or no linking capabilities. The above embodiments may do so while maintaining a current toolchain flow of a general purpose language and allowing programmers to program both CPUs and data parallel devices together in a modular and componentized way.
Computer system 100 includes one or more processor packages 102, a memory system 104, zero or more input/output devices 106, zero or more display devices 108, zero or more peripheral devices 110, and zero or more network devices 112. Processor packages 102, memory system 104, input/output devices 106, display devices 108, peripheral devices 110, and network devices 112 communicate using a set of interconnections 114 that includes any suitable type, number, and configuration of controllers, buses, interfaces, and/or other wired or wireless connections.
Computer system 100 represents any suitable processing device configured for a general purpose or a specific purpose. Examples of computer system 100 include a server, a personal computer, a laptop computer, a tablet computer, a personal digital assistant (PDA), a mobile telephone, a smart phone, and an audio/video device. The components of computer system 100 (i.e., processor packages 102, memory system 104, input/output devices 106, display devices 108, peripheral devices 110, network devices 112, and interconnections 114) may be contained in a common housing (not shown) or in any suitable number of separate housings (not shown).
Processor packages 102 each include one or more processing cores that form execution hardware configured to execute instructions (i.e., software). Each processor package 102 may include processing cores with the same or different architectures and/or instruction sets. For example, the processing cores may include any combination of in-order execution cores, superscalar execution cores, and data parallel execution cores (e.g., GPU execution cores). Each processing core is configured to access and execute instructions stored in memory system 104. The instructions may include a basic input output system (BIOS) or firmware (not shown), an operating system (OS) 122, GP code 12, GP compiler 20, linker 30 with binder 34, DP device compiler 38, and GP executable 32 with DP device executable 40. Each processing core may execute the instructions in conjunction with or in response to information received from input/output devices 106, display devices 108, peripheral devices 110, and/or network devices 112.
Computer system 100 boots and executes OS 122. OS 122 includes instructions executable by the processing cores to manage the components of computer system 100 and provide a set of functions that allow programs to access and use the components. In one embodiment, OS 122 is the Windows operating system. In other embodiments, OS 122 is another operating system suitable for use with computer system 100. Computer system 100 executes GP compiler 20, linker 30, binder 34, and DP device compiler 38 to generate GP executable 32 with DP device executable 40 from GP code 12 as described above. Computer system 100 may execute GP executable 32, including DP device executable 40, using one or more processing cores as described with reference to the embodiment of
Memory system 104 includes any suitable type, number, and configuration of volatile or non-volatile storage devices configured to store instructions and data. The storage devices of memory system 104 represent computer readable storage media that store computer-executable instructions (i.e., software) including OS 122, GP code 12, GP compiler 20, linker 30, binder 34, DP device compiler 38, and GP executable 32 with DP device executable 40. The instructions are executable by computer system 100 to perform the functions and methods of OS 122, GP code 12, GP compiler 20, linker 30, binder 34, DP device compiler 38, GP executable 32, and DP device executable 40 as described herein. Memory system 104 stores instructions and data received from processor packages 102, input/output devices 106, display devices 108, peripheral devices 110, and network devices 112. Memory system 104 provides stored instructions and data to processor packages 102, input/output devices 106, display devices 108, peripheral devices 110, and network devices 112. Examples of storage devices in memory system 104 include hard disk drives, random access memory (RAM), read only memory (ROM), flash memory drives and cards, and magnetic and optical disks such as CDs and DVDs.
Input/output devices 106 include any suitable type, number, and configuration of input/output devices configured to input instructions or data from a user to computer system 100 and output instructions or data from computer system 100 to the user. Examples of input/output devices 106 include a keyboard, a mouse, a touchpad, a touchscreen, buttons, dials, knobs, and switches.
Display devices 108 include any suitable type, number, and configuration of display devices configured to output textual and/or graphical information to a user of computer system 100. Examples of display devices 108 include a monitor, a display screen, and a projector.
Peripheral devices 110 include any suitable type, number, and configuration of peripheral devices configured to operate with one or more other components in computer system 100 to perform general or specific processing functions.
Network devices 112 include any suitable type, number, and configuration of network devices configured to allow computer system 100 to communicate across one or more networks (not shown). Network devices 112 may operate according to any suitable networking protocol and/or configuration to allow information to be transmitted by computer system 100 to a network or received by computer system 100 from a network.
In one embodiment, DP device 210 represents a graphics card where one or more graphics processing units (GPUs) include PEs 212 and a memory 214 that is separate from memory 104 (
In another embodiment, DP device 210 is formed from the combination of one or more GPUs (i.e. PEs 212) that are included in processor packages 102 (
In further embodiment, DP device 210 is formed from the combination of one or more vector processing pipelines in one or more of the execution cores of processor packages 102 (
In yet another embodiment, DP device 210 is formed from the combination of one or more scalar processing pipelines in one or more of the execution cores of processor packages 102 (
Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific embodiments shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein. Therefore, it is intended that this invention be limited only by the claims and the equivalents thereof.
Number | Name | Date | Kind |
---|---|---|---|
5261095 | Crawford et al. | Nov 1993 | A |
7487496 | O'Brien et al. | Feb 2009 | B2 |
8136102 | Papakipos et al. | Mar 2012 | B2 |
8161464 | Archambault et al. | Apr 2012 | B2 |
8209664 | Yu et al. | Jun 2012 | B2 |
8250549 | Reid et al. | Aug 2012 | B2 |
8296743 | Linderman et al. | Oct 2012 | B2 |
8375374 | O'Brien et al. | Feb 2013 | B2 |
8397241 | Xiaocheng et al. | Mar 2013 | B2 |
8443348 | McGuire et al. | May 2013 | B2 |
8443349 | Papakipos et al. | May 2013 | B2 |
8584106 | Papakipos et al. | Nov 2013 | B2 |
8589867 | Zhang et al. | Nov 2013 | B2 |
20040003370 | Schenk et al. | Jan 2004 | A1 |
20060098018 | Tarditi et al. | May 2006 | A1 |
20060195828 | Nishi et al. | Aug 2006 | A1 |
20080001952 | Srinivasan et al. | Jan 2008 | A1 |
20080018652 | Toelle et al. | Jan 2008 | A1 |
20080109795 | Buck et al. | May 2008 | A1 |
20080244599 | Hodson et al. | Oct 2008 | A1 |
20090007085 | Owen et al. | Jan 2009 | A1 |
20090259828 | Grover et al. | Oct 2009 | A1 |
20090322751 | Oneppo et al. | Dec 2009 | A1 |
20100192130 | Hawblitzel et al. | Jul 2010 | A1 |
20100205580 | McAllister et al. | Aug 2010 | A1 |
20110035736 | Stefansson et al. | Feb 2011 | A1 |
20110314256 | Callahan et al. | Dec 2011 | A1 |
20110314444 | Zhang et al. | Dec 2011 | A1 |
20120005662 | Ringseth et al. | Jan 2012 | A1 |
Entry |
---|
Aho, Alfred V.; Sethi, Ravi; and Ullman, Jeffery D., Compilers: Principles, Techniques, and Tools c.1986 Reprinted Mar. 1988 Addison-Wesley pp. 11-12. |
Aho, Alfred V.; Sethi, Ravi; and Ullman, Jeffery D., Compilers: Principles, Techniques, and Tools c.1986 Reprinted Mar. 1988 Addison-Wesley pp. 11-12, 284-393, 424, 463, and 546-547. |
Overland, Brian; C++ Without Fear: A Beginners Guide That Makes You Feel Smart. c2005, 7th printing Dec. 2007. Prentice Hall Proffesional Technical Reference. pp. 231-234. |
Ueng, Sain-Zee, et al., CUDA-lite: Reducing GPU programming complexity, [Online] 2008, Languages and Compilers for Parallel Computing. Springer Berlin Heidelberg, 2008, [Retrieved from the Internet] <http://download.springer.com/static/pdf/810/bok%253A978-3-540-89740-8.pdf?auth66=1390765121—6203534a83d560db0310ac2893fda019&ext=.pdf> pp. 1-15. |
Klöckner et al., PyCUDA: GPU run-time code generation for high-performance computing, [Online] 2009, [Retrieved from the Internet] <http://www.cs.berkeley.edu/˜yunsup/papers/PyCUDA-2009.pdf> 19 Pages. |
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann, OpenMP to GPGPU: a compiler framework for automatic translation and optimization, [Online] 2009, SIGPLAN Not.44, 4 (Feb. 2009), [Retrieved from the Internet] <http://doi.acm.org/10.1145/1594835.1504194> pp. 101-110. |
Seyong Lee and Rudolf Eigenmann, OpenMPC: Extended OpenMP Programming and Tuning for GPUs, [Online] 2010 in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10), [Retrieved from the Internet] <http://dx.doi.org/10.1109/SC.2010.36> pp. 1-11. |
Sukhwani, et al., “Fast Binding Site Mapping using GPUs and CUDA”, Retrieved at << http://www.bu.edu/caadlab/GPU—FTMap—TR2010—1.pdf >>, IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Apr. 19-23, 2010, pp. 13. |
Lee, et al., “GPU Kernels as Data-Parallel Array Computations in Haskell”, Retrieved at << http://www.cse.unsw.edu.au/˜chak/papers/gpugen.pdf >>, Feb. 12, 2009, pp. 7. |
Folkegård, et al., “Dynamic Code Generation for Realtime Shaders”, Retrieved at << http://www.ep.liu.se/ecp/013/ecp04013.pdf >>, The Annual SIGRAD Conference, Special Theme—Environmental Visualization, SIGRAD, Nov. 24-25, 2004, pp. 69. |
Buck, et al., “Brook for GPUs: Stream Computing on Graphics Hardware”, Retrieved at << http://graphics.stanford.edu/papers/brookgpu/brookgpu—0285—submitted.pdf >>, ACM Transactions on Graphics (TOG), Proceedings of ACM SIGGRAPH, vol. 23, No. 3, Aug. 2004, pp. 10. |
Number | Date | Country | |
---|---|---|---|
20110314458 A1 | Dec 2011 | US |