The invention relates to the compilation and execution of code. More specifically, the invention relates to accessing of thread-privatized global storage objects during such compilation and execution.
Parallel computing of tasks achieves faster execution and/or enables the performance of complex tasks that single process systems cannot perform. One paradigm for performing parallel computing is shared-memory programming. The OpenMP standard is an agreed upon industry standard for programming shared memory architectures in a multi-threaded environment.
In a multi-threaded environment, privatization for global storage objects that can be accessed by a number of computer programs and/or threads is a technique that allows for parallel processing of such computer programs and thereby allow for enhancement in the speed and performance of these programs. In particular, privatization refers to a process of providing individual copies of global storage objects in a global memory address space for multiple processors or threads of execution.
One current approach to privatization can be implemented via a hardware partitioning of a computer system's physical address space into shared and private regions. In addition to the limitation of being hardware-specific, this approach suffers either from limits on the size of private storage areas, from difficulties in efficiently utilizing fixed-size global and private storage areas and from difficulties in managing ownership of various storage areas in a multiprocessing or multiprogramming environment.
Embodiments of the invention may be best understood by referring to the following description and accompanying drawings that illustrate such embodiments. The numbering scheme for the Figures included herein are such that the leading number for a given element in a Figure is associated with the number of the Figure. For example, system 100 can be located in
In the drawings:
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be evident, however, to one skilled in the art that the present invention may be practiced without these specific details.
Embodiments of the present invention are portable to different operating systems, hardware architectures, parallel programming paradigms, programming languages, compilers, linkers, run time environments and multi-threading environments. Moreover, embodiments of the present invention allow portions of what was executing during run time of the user's code to compile time and prior thereto. In particular, as will be described, embodiments of the present invention enable the exporting of a copy of a data structure that was internal to a run time library into the program units of the code (e.g., source code), thereby increasing the run time speed and performance of the code. A copy of the data structure is loaded into the software cache through a single access to a routine in the run time library such that subsequent accesses by the threads to their thread private variable are to the software cache and not to the run time library.
As illustrated in
Chipset 120 for one embodiment comprises memory controller hub (MCH) 130, input/output (I/O) controller hub (ICH) 140, and firmware hub (FWH) 170. MCH 130, ICH 140, and FWH 170 may each comprise any suitable circuitry and for one embodiment is each formed as a separate integrated circuit chip. Chipset 120 for other embodiments may comprise any suitable one or more integrated circuit devices.
MCH 130 may comprise any suitable interface controllers to provide for any suitable communication link to processor bus 110 and/or to any suitable device or component in communication with MCH 130. MCH 130 for one embodiment provides suitable arbitration, buffering, and coherency management for each interface.
MCH 130 is coupled to processor bus 110 and provides an interface to processors 102 and 104 over processor bus 110. Processor 102 and/or processor 104 may alternatively be combined with MCH 130 to form a single chip. MCH 130 for one embodiment also provides an interface to a main memory 132 and a graphics controller 134 each coupled to MCH 130. Main memory 132 stores data and/or instructions, for example, for computer system 100 and may comprise any suitable memory, such as a dynamic random access memory (DRAM) for example. Graphics controller 134 controls the display of information on a suitable display 136, such as a cathode ray tube (CRT) or liquid crystal display (LCD) for example, coupled to graphics controller 134. MCH 130 for one embodiment interfaces with graphics controller 134 through an accelerated graphics port (AGP). Graphics controller 134 for one embodiment may alternatively be combined with MCH 130 to form a single chip.
MCH 130 is also coupled to ICH 140 to provide access to ICH 140 through a hub interface. ICH 140 provides an interface to I/O devices or peripheral components for computer system 100. ICH 140 may comprise any suitable interface controllers to provide for any suitable communication link to MCH 130 and/or to any suitable device or component in communication with ICH 140. ICH 140 for one embodiment provides suitable arbitration and buffering for each interface.
For one embodiment, ICH 140 provides an interface to one or more suitable integrated drive electronics (IDE) drives 142, such as a hard disk drive (HDD) or compact disc read only memory (CD ROM) drive for example, to store data and/or instructions for example, one or more suitable universal serial bus (USB) devices through one or more USB ports 144, an audio coder/decoder (codec) 146, and a modem codec 148. ICH 140 for one embodiment also provides an interface through a super I/O controller 150 to a keyboard 151, a mouse 152, one or more suitable devices, such as a printer for example, through one or more parallel ports 153, one or more suitable devices through one or more serial ports 154, and a floppy disk drive 155. ICH 140 for one embodiment further provides an interface to one or more suitable peripheral component interconnect (PCI) devices coupled to ICH 140 through one or more PCI slots 162 on a PCI bus and an interface to one or more suitable industry standard architecture (ISA) devices coupled to ICH 140 by the PCI bus through an ISA bridge 164. ISA bridge 164 interfaces with one or more ISA devices through one or more ISA slots 166 on an ISA bus.
ICH 140 is also coupled to FWH 170 to provide an interface to FWH 170. FWH 170 may comprise any suitable interface controller to provide for any suitable communication link to ICH 140. FWH 170 for one embodiment may share at least a portion of the interface between ICH 140 and super I/O controller 150. FWH 170 comprises a basic input/output system (BIOS) memory 172 to store suitable system and/or video BIOS software. BIOS memory 172 may comprise any suitable non-volatile memory, such as a flash memory for example.
Additionally, computer system 100 includes translation unit 180, compiler unit 182 and linker unit 184. In an embodiment, translation unit 180, compiler unit 182 and linker unit 184 can be processes or tasks that can reside within main memory 132 and/or processors 102 and 104 and can be executed within processors 102 and 104. However, embodiments of the present invention are not so limited, as translation unit 180, compiler unit 182 and linker unit 184 can be different types of hardware (such as digital logic) executing the processing described therein (which is described in more detail below).
Accordingly, computer system 100 includes a machine-readable medium on which is stored a set of instructions (i.e., software) embodying any one, or all, of the methodologies described above. For example, software can reside, completely or at least partially, within main memory 132 and/or within processors 102/104. For the purposes of this specification, the term “machine-readable medium” shall be taken to include any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.); etc.
Additionally, program unit(s) 202 can include one to a number of global storage objects. In an embodiment, global storage objects are storage locations that are addressable across a number of program units. Examples of such objects can include simple (scalar) global variables and compound (aggregate) global objects such as structs, unions and classes in C and C++ and COMMON blocks and STRUCTUREs in Fortran.
In an embodiment, translation unit 180 performs a source-to-source code level transformation of program unit(s) 202 to generate translated program unit(s) 204. However, embodiments of the present invention are not so limited. For example, in another embodiment, translation unit 180 could perform a source-to-assembly code level transformation of program unit(s) 202. In an alternative embodiment, translation unit 180 could perform an assembly-to-source code level transformation of program unit(s) 202. This transformation of program unit(s) 202 is described in more detail below in conjunction with the flow diagrams illustrated in
Compiler unit 182 receives translated program units 204 and generates object code 208. Compiler unit 182 can be different compilers for different operating systems and/or different hardware. For example, in an embodiment, compiler unit 182 can generate object code 208 to be executed on different types of Intel® processors. Moreover, in an embodiment, the compilation of translated program unit(s) 204 is based on the OpenMP industry standard.
Linker unit 184 receives object code 208 and runtime library 206 and generates executable code 210. Runtime library 206 can include one to a number of different functions or routines that are incorporated into translated program unit(s) 204. Examples of such functions or routines could include, but are not limited to, a threadprivate support function (which is discussed in more detail below), functions for the creation and management of thread teams, function for lock synchronization and barrier scheduling support and query functions for thread team size or thread identification. In one embodiment, executable code 210 that is output from linker unit 184 can be executed in a multi-processor shared memory environment. Additionally, executable program unit(s) 210 can be executed across a number of different operating system platforms, including, but not limited to, different versions of UNIX, Microsoft Windows™, and real time operating systems such as VxWorks™, etc.
The operation of translation unit 180 will now be described in conjunction with the flow diagram of
In contrast, upon determining that there are remaining program unit(s) 202 to be translated, translation unit 180 determines whether there are any remaining global storage objects to be privatized within the current program unit 202 being translated, at process decision block 304. In an embodiment, this determination is made based on the declaration of the objects within the program unit(s) 202 (i.e., the objects being defined as “thread private”).
Returning to
The incorporation of initialization logic to enable accessing of the thread private variables into the applicable program units will now be described. In particular,
In an embodiment, the cache object is stored within the software cache. To help illustrate the cache objects,
As shown, memory 714 includes thread private variables 704A-C and thread private variables 708A-C. Thread private variables 704A-C and thread private variables 708A-C are storage locations for private copies of global storage objects that have been designated to include private copies for each thread, which is accessing such objects, (as described above in conjunction with
Further, memory 714 includes cache object 702 and cache object 706. In an embodiment, the addresses of cache objects 702 and 706 are in a fixed location with respect to the source code being translated by translation unit 180. For example, the beginning of the source code and associated data could be at 0x50, and cache object 702 could be stored at 0x100 while cache object 706 could be stored at 0x150. While cache objects 702 and 706 can be different types of data structures for the storage of pointers, in one embodiment, cache objects 702 and 706 are arrays of pointers.
As shown, cache object 702 includes pointers 710A-710C, which could be one to a number of pointers. Moreover, each of pointers 710A-710C point to one of thread private variables 704A-C. In particular, pointer 710A points to thread private variable 704A, pointer 710B points to thread private variable 704B and pointer 710C points to thread private variable 704C. Cache object 706 includes pointers 712A-712C, which could be one to a number of pointers. Moreover, each of pointers 712A-712C point to one of thread private variables 708A-C. In particular, pointer 712A points to thread private variable 708A, pointer 712B points to thread private variable 708B and pointer 712C points to thread private variable 708C.
Returning to process decision block 502 of
In contrast, upon determining that the cache object for this global storage object has been created/generated, the initialization logic sets a variable assigned to the pointer (hereinafter “the thread private pointer variable”) to the value of the pointer for this particular thread based on the identification of the thread, at process block 506. With regard to code segment 600 of
In particular, the identification of the thread is employed to index into the cache object to locate the value of the pointer. For example, if the number of threads to execute the program unit(s) 204 equals five, the thread having an identification of two would be the third value in the array if the cache object were an array of pointers (using a zero-based indexing). Accordingly, the initialization logic can determine whether the pointer located at the particular index in the cache object is set. Returning to
Additionally, the initialization logic (illustrated by method 500) determines whether the thread private pointer variable for this particular thread is a non-zero value, at process decision block 508 (as illustrated by the “if” statement in code segment 610 of
Upon determining that the address for the cache object is zero, this run time library routine allocates the cache object at the fixed address for the cache object. Additionally, the run time library routine creates/generates the thread private variable and stores the address of this variable into the appropriate location within the cache object. For example, if the cache object were an array of pointers wherein the index into this array is defined by the identification of the thread, the appropriate location would be based on this thread identification. Upon determining that the address for the cache object is non-zero, this run time library routine does not reallocate the cache object. Rather, the run time library routine creates/generates the thread private variable and stores the address of this variable into the appropriate location within the cache object. In one embodiment, the addresses of the thread private pointer variable and the cache object are returned through the parameters of the run time library routine. In another embodiment, only the address of the cache object is returned through the parameters of the run time library routine, as the address of the thread private pointer variable is stored within the cache object (thereby reducing the amount of data returned by the run time library routine). Accordingly, the initialization logic receives these addresses of the thread pointer variable and the pointer to the cache object, at process block 512. Method 500 is complete at process block 514.
Upon determining that the thread pointer variable for this particular thread is a non-zero value (thereby indicating that the cache object has been created/generated and the thread pointer variable has been assigned to the memory location of the thread private variable), the initialization logic is complete at process block 514. Therefore, as described above in conjunction with process block 308 of
Accordingly, embodiments of the present invention are exporting a copy of a data structure that was internal to the run time library into the program units of the code, thereby increasing the run time speed and performance of the code. In particular, a copy of the data structure is loaded into the software cache through a single access to a routine in the run time library such that subsequent accesses by a thread to its thread private variable are to the software cache and not to the run time library. Additionally, as illustrated, initialization logic is in-lined within the program unit(s) for the global storage objects to reduce the number of accesses to the run time library. As shown, translation unit 180 has introduced initialization logic that moves the accessing of the thread private variables of global storage objects from run time to compile time as the introduction of such logic enables the compiler to determine what data needs to be stored as well as the storage location of such data. Moreover, the allocation of a cache object for a given global storage object is demand driven, such that the first thread allocates the cache object with subsequent accesses to thread private variables being accessed through this single cache object by other threads executing the program units within the code.
Further, embodiments of the present invention exploit the monotonic characteristic of addresses of the cache object and the thread private variables. In particular, such addresses are initialized to a zero or NULL value and are written once to transition to the final allocated value. Embodiments of the present invention also exploit the coherent nature of a shared memory system, such that a pointer can be in one of two states (either in the original state or the modified state). Embodiments of the present invention also allow for a lock-free design after creation of the cache object in a coherent memory parallel processing environment.
Thus, a method and apparatus for accessing thread privatized global storage objects have been described. Although the present invention has been described with reference to specific exemplary embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the invention. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
The present patent application is a Divisional of application Ser. No. 09/966,518, filed Sep. 28, 2001, entitled “Method and Apparatus for Accessing Thread-Privatized Global Storage Objects”.
Number | Date | Country | |
---|---|---|---|
Parent | 09966518 | Sep 2001 | US |
Child | 11437352 | May 2006 | US |