This application is directed, in general, to computing systems and, more specifically, to a system and method for translating program functions for correct handling of local-scope variables and a computing system incorporating the same.
As those skilled in the art are familiar, software programs make use of variables, for which storage space must be allocated in memory. Often the storage space is allocated in the form of a stack.
Some programs organize their variables into levels to simplify their data structure. For example, some variables, known as local-scope variables, may be used only in one function (function-scope) or one block of statements (block-scope) in the program. Other variables, known as global-scope variables, may be used by the entire program. For example, Table 1, below, sets forth pseudocode illustrating two levels of local-scope variables: an array a[ ] being a function-scope variable, and array b[ ] being a block-scope variable.
For purposes of understanding Table 1 and the remainder of this disclosure, the terms “foo” and “bar” are arbitrary names of functions. Any function can therefore be substituted for “foo” or “bar.”
Some programs are parallel programs, in that they are capable of being executed in parallel in a parallel processor, such as a graphics processing unit (GPU). Parallel programs have sequential regions of code that cannot be executed in parallel and parallel regions of code that can be executed in parallel, in threads. In parallel programs, multiple threads need to gain access to some local-scope variables as defined by the programming model according to which the program was developed. OpenMP and OpenACC are two examples of conventional programming models for developing parallel programs. Tables 2 and 3, below, set forth pseudocode illustrating respective OpenMP and OpenACC examples.
One aspect provides a system for translating functions of a program. In one embodiment, the system includes: (1) a local-scope variable identifier operable to identify local-scope variables employed in the at least some of the functions as being either thread-shared local-scope variables or thread-private local-scope variables and (2) a function translator associated with the local-scope variable identifier and operable to translate the at least some of the functions to cause thread-shared memory to be employed to store the thread-shared local-scope variables and thread-private memory to be employed to store the thread-private local-scope variables.
Another aspect provides a method of translating functions of a program. In one embodiment, the method includes: (1) identifying at least some of the functions as either being executed during sequential regions of the program or being executed during parallel regions of the program, (2) identifying local-scope variables employed in the at least some of the functions as being either thread-shared local-scope variables or thread-private local-scope variables and (3) allocating thread-private memory for storage of the thread-private local-scope variables and thread-shared local-scope variables employed in functions executed during the parallel regions of the program, and thread-shared memory for storage of thread-shared local-scope variables employed in functions executed during the sequential regions of the program.
In another embodiment, the method includes: (1) creating corresponding shared clones and private clones of at least some of the functions and (2) invoking the shared clones in lieu of the corresponding functions during execution of sequential regions of the program and the private clones in lieu of the corresponding functions during execution of parallel regions of the program.
Reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
Some computer architectures, including those centered about general-purpose central processing units (CPUs), typically provide a memory structure that is, by default, shared over the entirety of a given program such that all blocks, and all functions in the blocks, in the program have access to all of the variables. Allocating storage space in such architectures is straightforward; it is allocated in the shared memory structure.
However, certain computing systems, such as GPUs, employ a stratified memory architecture in which some memory is thread-private (accessible only by a given thread), and other memory is thread-shared (accessible by multiple threads, e.g., in a given block or the entirety of a program).
It is realized that the local-scope variables defined in some programs are exclusively used in one thread, but that others need to be shared between threads. It is further realized, however, that programs in many programming models (OpenMP or OpenACC for example) do not draw a distinction between thread-private and thread-shared local-scope variables.
A conservative, but trivial approach to a solution might be to allocate storage space in the thread-shared memory for all of the local-scope variables. However, it is realized herein that the thread-shared memory is often too small to contain all local-scope variables. An alternative approach might be to allocate storage space in the thread-private memory for all of the local-scope variables. However, the local-scope variables could not then be shared in a straightforward manner between threads. Instead, local-scope variables would have to be copied between thread-private memories to effect sharing, which is clumsy and slows program execution. Yet another approach might be to rewrite the program body to incorporate called functions into the functions calling them, a technique called inlining. However, this obscures the program's structure, making it difficult to understand and modify. Still another approach might be to analyze the program manually, designating local-scope variables for storage in either thread-private or thread-shared memory by hand. However, this approach is labor-intensive and error-prone.
It is realized herein that an automated mechanism is needed for translating the functions of a program such that local-scope variables requiring thread-shared memory (thread-shared local-scope variables) are identified. Once identified, the local-scope variables may then be handled correctly. More particularly, storage space in thread-shared memory can be allocated for the thread-shared local-scope variables, and storage space in thread-local memory can be allocated for the local-scope variables not requiring shared memory (thread-private local-scope variables).
Accordingly, introduced herein are various embodiments of a system and method for translating functions for correct handling of local-scope variables in thread-shared and private contexts. In general, the system and method embodiments identify local-scope variables as being either thread-shared local-scope variables or thread-private local-scope variables and allocate thread-private or thread-shared memory to store the local-scope variables based on their identification and the region in which the function employing them is being executed: either in a sequential region or a parallel region.
Before describing various embodiments of the novel system and method in greater detail, a representative computing system will now be described.
Computing system 100 further includes a pipeline control unit 108, shared memory 110 and an array of local memory 112-1 through 112-J associated with thread groups 104-1 through 104-J. Pipeline control unit 108 distributes tasks to the various thread groups 104-1 through 104-J over a data bus 114. Pipeline control unit 108 creates, manages, schedules, executes and provides a mechanism to synchronize thread groups 104-1 through 104-J.
Cores 106 within a thread group execute in parallel with each other. Thread groups 104-1 through 104-J communicate with shared memory 110 over a memory bus 116. Thread groups 104-1 through 104-J respectively communicate with local memory 112-1 through 112-J over local buses 118-1 through 118-J. For example, a thread group 104-J utilizes local memory 112-J by communicating over a local bus 118-J. Certain embodiments of computing system 100 allocate a shared portion of shared memory 110 to each thread block 102 and allow access to shared portions of shared memory 110 by all thread groups 104 within a thread block 102. Certain embodiments include thread groups 104 that use only local memory 112. Many other embodiments include thread groups 104 that balance use of local memory 112 and shared memory 110.
The embodiment of
Having described a representative computing system, various embodiments of the novel system and method for translating functions for correct handling of local-scope variables in thread-shared and private contexts will now be described in greater detail.
Program 210 may be apportioned into sequential regions 214 and parallel regions 216. Functions 212 within program 210 may be executed from either of the sequential regions 214 and parallel regions 216, and are often executed from both at various points in the sequence of program 210. Functions 212 employ variables that are of either global-scope or local-scope with respect to a particular function of functions 212. Of the local-scope variables within the particular function, there are thread-shared and thread-private variables. Thread-shared local-scope variables are accessible by all parallel threads within program kernels 218 and executing on computing system 220. Thread-private local-scope variables are accessible only by the thread for which the variables are instantiated.
Local-scope variable identifier 202 includes a thread-shared local-scope variable identifier 204 and a thread-private local-scope variable identifier 206. Thread-shared local-scope variable identifier 204 operates on a single function of functions 212 by determining (1) whether the single function contains a parallel construct within which access to a local-scope variable can be had by different threads within the parallel construct, or (2) whether a local-scope variable within the single function escapes to other threads of execution (under escape analysis). If either of these conditions is true, the local-scope variable is a thread-shared local-scope variable. Otherwise, thread-private local-scope variable identifier 206 identifies the local-scope variable as thread-private local-scope.
Function translator 208 operates on the single function of functions 212 by generating directives in program kernels 218 as to how memory should be allocated for each local-scope variable in the single function. For thread-shared local-scope variables, blocks of thread-shared memory 226 are allocated. For thread-private local-scope variables, blocks of thread-private memory 224 are allocated. When program kernels 218 are executed on processor cores 222 within computing system 220, i.e., at run-time, the process of allocating memory for variables begins according to the directives generated by function translator 208.
Realized herein are two methods of translating a function such that local-scope variables are selectively allocated thread-shared memory or thread-private memory. The methods realized herein do not require “inlining” all function calls or a whole-program analysis. The methods realized herein are general and support separate compilation and linking, which is one foundation of building large modular software.
The two methods disclosed herein take advantage of two characteristics of parallel programming models (e.g., OpenMP and OpenACC). The first is that not all local-scope variables need to be shared. The methods realized herein recognize this fact and identify the necessarily thread-shared local-scope variables. The second characteristic is that when a function is called within a parallel region of a program, a thread cannot gain access to the local-scope variables of other threads. The methods realized herein recognize local-scope variables need not be shared when instantiated within a function called within the parallel region of the program.
One specific embodiment of the first of the two methods disclosed herein includes the following procedures:
1. A block of thread-shared memory is pre-allocated for a shared stack. A block of thread-private memory is pre-allocated for a private stack associated with each thread.
2. Thread-shared local-scope variables and thread-private local-scope variables within a function are identified as follows:
3. A function foo( ) is cloned into two extra functions: a shared clone foo_shared( ) and a private clone foo_private( ).
a. In foo_shared( )
b. In foo_private( )
4. The original function foo( ) is translated into the following form:
where “in_parallel_flag” is a per thread flag described in procedure 6 below. When addressing the function, the address of the translated form is used.
5. After procedures 3 and 4, only private clones are called directly within any parallel region. Only shared clones are called directly outside any parallel region. Indirect calls are made to the transformed version from procedure 4. Only thread-shared or potentially thread-shared local-scope variables are put in thread-shared memory.
6. A per-thread flag is used to mark the state of a thread. The value of the flag is either “true” or “false.” When a thread enters a parallel region, the flag is set to “true.” When the thread leaves a parallel region, the flag is set to “false.”
In a step 330, corresponding shared clones and private clones of at least some of the functions are created. In a step 340, local-scope variables employed in the functions are identified as being either thread-shared local-scope variables or thread-private local-scope variables. In a step 350, for at least one of the shared clones, thread-shared memory is allocated for storage of thread-shared local-scope variables and thread-private memory for storage of the thread-private local-scope variables. In a step 360, for at least one of the private clones, thread-private memory is allocated for storage of all local-scope variables. In a decisional step 370, it is determined whether or not a parallel region of the program is being executed. If so, in a step 380, the private clones are invoked in lieu of the corresponding functions during execution. If not, in a step 390, the shared clones are invoked in lieu of the corresponding functions.
In one embodiment, the invoking of the steps 380, 390 is effected by: (1) in the at least one of the shared clones, translating a call of a function: (1a) made in a parallel construct into a call of a private clone of the function and (1b) made in a sequential construct into a call of a shared clone of the function and (2) in the at least one of the private clones, translating a call of a function into a call of a private clone of the function. In a related embodiment, the invoking of the steps 380, 390 is effected by employing addresses of at least some of the functions in executing the program. In another related embodiment, the invoking of the steps 380, 390 is effected by translating the at least some of the functions to cause the invoking to be carried out. In a related embodiment, the allocating includes employing a flag having a first state during the execution of the sequential regions and a second state during the execution of the parallel regions. The method ends in an end step 395.
One specific embodiment of the second of the two methods disclosed herein includes the following procedures:
1. A block of thread-shared memory is pre-allocated for a shared stack. A block of thread-private memory is pre-allocated for a private stack associated with each thread.
2. Thread-shared local-scope variables and thread-private local-scope variables within a function are identified as follows:
3. In a function foo( ) all thread-private local-scope variables identified in procedure 2 are allocated in the thread-private memory. Each of the thread-shared local-scope variables identified in procedure 2 is allocated conditionally based on whether the function is called within a serial and a parallel context.
4. A per-thread flag is used to mark the state of a thread. The value of the flag is either “true” or “false.” When a thread enters a parallel region, the flag is set to “true.” When the thread leaves a parallel region, the flag is set to “false.”
In a step 430, at least some of the functions are identified as either being executed during sequential regions of the program or being executed during parallel regions of the program. In a step 440, local-scope variables employed in the at least some of the functions are identified as being either thread-shared local-scope variables or thread-private local-scope variables. In one embodiment, the identifying of the local-scope variables includes identifying a local-scope variable as a thread-shared local-scope variable if a function contains a parallel construct and multiple threads can gain access to the local-scope variable within the parallel construct according to a programming model or the local-scope variable escapes the function and identifying remaining local-scope variables as thread-private local-scope variables.
In a step 450, the thread-private memory is allocated for storage of the thread-private local-scope variables and thread-shared local-scope variables employed in functions executed during the parallel regions of the program. In a step 460, the thread-shared memory is allocated for storage of thread-shared local-scope variables employed in functions executed during the sequential regions of the program. In one embodiment, at least some of the functions are translated to cause the allocating to be carried out. In a related embodiment, the allocating includes employing a flag having a first state during the execution of the sequential regions and a second state during the execution of the parallel regions. The method ends in an end step 470.
Those skilled in the art to which this application relates will appreciate that other and further additions, deletions, substitutions and modifications may be made to the described embodiments.
This application claims the benefit of U.S. Provisional Application Ser. No. 61/722,661, filed by Lin, et al., on Nov. 5, 2012, entitled “Executing Sequential Code Using a Group of Threads,” commonly assigned with this application and incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5875464 | Kirk | Feb 1999 | A |
6088770 | Tarui et al. | Jul 2000 | A |
6609193 | Douglas et al. | Aug 2003 | B1 |
7086063 | Ousterhout et al. | Aug 2006 | B1 |
7856541 | Kaneda et al. | Dec 2010 | B2 |
8250555 | Lee et al. | Aug 2012 | B1 |
8335892 | Minkin et al. | Dec 2012 | B1 |
8397013 | Rosenband et al. | Mar 2013 | B1 |
8516483 | Chinya et al. | Aug 2013 | B2 |
8547385 | Jiao | Oct 2013 | B2 |
8615646 | Nickolls et al. | Dec 2013 | B2 |
8683132 | Danilak | Mar 2014 | B1 |
20020029357 | Charnell et al. | Mar 2002 | A1 |
20030018684 | Ohaswa et al. | Jan 2003 | A1 |
20050125774 | Barsness | Jun 2005 | A1 |
20060095675 | Yang et al. | May 2006 | A1 |
20070136523 | Bonella et al. | Jun 2007 | A1 |
20070143582 | Coon et al. | Jun 2007 | A1 |
20070192541 | Balasubramonian et al. | Aug 2007 | A1 |
20070294512 | Crutchfield et al. | Dec 2007 | A1 |
20080052466 | Zulauf | Feb 2008 | A1 |
20080109795 | Buck | May 2008 | A1 |
20080126716 | Daniels | May 2008 | A1 |
20080276262 | Munshi | Nov 2008 | A1 |
20090006758 | Chung et al. | Jan 2009 | A1 |
20090013323 | May et al. | Jan 2009 | A1 |
20090031290 | Feng | Jan 2009 | A1 |
20090100244 | Chang et al. | Apr 2009 | A1 |
20100079454 | Legakis et al. | Apr 2010 | A1 |
20100281230 | Rabii et al. | Nov 2010 | A1 |
20110022672 | Chang | Jan 2011 | A1 |
20110072214 | Li et al. | Mar 2011 | A1 |
20110072438 | Fiyak et al. | Mar 2011 | A1 |
20110125974 | Anderson | May 2011 | A1 |
20110191522 | Condict et al. | Aug 2011 | A1 |
20110265068 | Elnozahy et al. | Oct 2011 | A1 |
20110320804 | Chan et al. | Dec 2011 | A1 |
20120072652 | Celis et al. | Mar 2012 | A1 |
20120089792 | Fahs et al. | Apr 2012 | A1 |
20120131309 | Johnson | May 2012 | A1 |
20120137055 | Lee et al. | May 2012 | A1 |
20120137099 | Shibayama et al. | May 2012 | A1 |
20120151179 | Gaertner et al. | Jun 2012 | A1 |
20120191953 | Eichenberger et al. | Jul 2012 | A1 |
20120204065 | Tsafrir et al. | Aug 2012 | A1 |
20120254530 | Tagaya | Oct 2012 | A1 |
20130263153 | Gschwind | Oct 2013 | A1 |
20140007114 | Wang | Jan 2014 | A1 |
20140129783 | Marathe et al. | May 2014 | A1 |
20140129812 | Chakrabarti et al. | May 2014 | A1 |
20140130052 | Lin et al. | May 2014 | A1 |
Number | Date | Country |
---|---|---|
1725176 | Jan 2006 | CN |
101176066 | May 2008 | CN |
101819675 | Sep 2010 | CN |
201140447 | Nov 2011 | TW |
201220246 | May 2012 | TW |
Entry |
---|
Reyes et al. (“accULL: An OpenACC Implementation with CUDA and OpenCL Support,” 2012). |
Chen, Dehao, et al., “Tree Partition based Parallel Frequent Pattern Mining on Shared Memory Systems”, IEEE, Apr. 25, 2066, 8 pages. |
Thomas, McCarthy, et al., “The Very Basics of Garbage Collection,” Aug. 26, 2008, 4 pages. |
Steffen, Michael, et al., “Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels”, 2010, 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 12 pages. |
Meng, Jiayuan, et al., “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance”, Jun. 2010, 12 pages. |
Number | Date | Country | |
---|---|---|---|
20140130021 A1 | May 2014 | US |
Number | Date | Country | |
---|---|---|---|
61722661 | Nov 2012 | US |