API for launching work on a processor

Information

  • Patent Grant
  • 9268601
  • Patent Number
    9,268,601
  • Date Filed
    Thursday, March 31, 2011
    13 years ago
  • Date Issued
    Tuesday, February 23, 2016
    8 years ago
  • CPC
  • Field of Search
    • US
    • NON E00000
  • International Classifications
    • G06F9/48
    • Term Extension
      197
Abstract
One embodiment of the present invention sets forth a technique for launching work on a processor. The method includes the steps of initializing a first state object within a memory region accessible to a program executing on the processor, populating the first state object with data associated with a first workload that is generated by the program, and triggering the processing of the first workload on the processor according to the data within the first state object.
Description
BACKGROUND OF THE INVENTION

1. Field of the Invention


Embodiments of the present invention relate generally to processor architectures and, more specifically, an application program interface (API) for launching work on a processor.


2. Description of the Related Art


In conventional computer systems, the processing power of a central processing unit (CPU) may be augmented by a co-processor, such as a GPU. GPUs are specialized processors that are configured to efficiently perform graphics processing operations or other operations that would otherwise be performed by the CPU. Some conventional computer systems are configured with a hybrid graphics system that includes, for example, an integrated GPU (iGPU) disposed on the motherboard along with the CPU and a discrete GPU (dGPU) located on an add-in card that is connected to the motherboard via a Peripheral Component Interconnect Express (PCI Express or PCIe) expansion bus and slot.


Typically, in such systems, work on the co-processor can only be launched by the CPU. Such a limitation can result in several inefficiencies. For example, if the co-processor is to execute a series of related tasks, where task B is dependent on the execution of task A, then the CPU will first launch task A on the GPU, wait until task A completes, and then launch task B. In such a scenario, because the CPU has to wait until the GPU indicates that task A has completed and then initiate the execution of task B, many clock cycles are wasted, thus reducing the overall performance of the system.


As the foregoing illustrates, what is needed in the art is an approach for launching work on a processor in a more efficient manner.


SUMMARY OF THE INVENTION

One embodiment of the present invention sets forth a method for launching work on a processor. The method includes the steps of initializing a first state object within a memory region accessible to a program executing on the processor, populating the first state object with data associated with a first workload that is generated by the program, and triggering the processing of the first workload on the processor according to the data within the first state object.


One advantage of the disclosed technique is that work can be launched on a processor from within the processor itself, thus eliminating wasted cycles in between the launching of two different tasks.





BRIEF DESCRIPTION OF THE DRAWINGS

So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.



FIG. 1 illustrates a processing environment configured to implement one or more aspects of the present invention;



FIG. 2 is a timeline view when launching work within the processing environment of FIG. 1, according to one embodiment of the invention;



FIG. 3 is a flow diagram of method steps for launching a workload generated by an application program on a processor, according to one embodiment of the invention; and



FIG. 4 is a conceptual diagram of a computing device configured to implement one or more aspects of the present invention.





DETAILED DESCRIPTION

In the following description, numerous specific details are set forth to provide a more thorough understanding of the present invention. However, it will be apparent to one of skill in the art that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring the present invention.



FIG. 1 illustrates a processing environment 100 configured to implement one or more aspects of the present invention. The processing environment 100 includes a processor 102, a program accessible memory 104, a processor driver 106, a work launching application program interface (API) 108 and an application program 110.


The processor 102 is coupled to the program accessible memory 104 and the processor driver 106. In operation, the processor 102 includes one or more processor cores that each executes a sequence of instructions associated with and/or transmitted by the various elements of the processing environment 100, such as the application program 110. The processor 102 can be a general purpose processor or a more special purpose processor, such as a graphics processing unit (GPU). The program accessible memory 104 is a memory space, usually a random access memory (RAM), that temporarily stores data needed to execute instructions within the processor 102. The data in the program accessible memory 104 can be set via software programs running within the system 100 at any given time.


In operation, software programs, such as application program 110, interact with the processor 102 via the processor driver 106. More specifically, the processor driver 106 transmits commands generated by the application program 110 to the processor 102 for execution. In some cases, to initiate execution of a particular workload within the processor 102, the application program 110 interfaces with the processor 102 via the work launching API 108. The work launching API 108 interfaces with the processor driver 106 and allows the application program 110 to launch workloads for execution on the processor 102.


To launch a workload, the application program 110 interacts with different API commands of the work launching API 108 to (i) allocate memory space in the program accessible memory 104 for a state object, (ii) store state information needed to execute the workload within the state object and (iii) trigger the execution of the workload. In one embodiment, the same state object may be shared across multiple workloads triggered by the application program 110 via the work launching API 108. In an another embodiment, where the processor 102 is a multi-threaded processor, different threads within the processor 102 may execute the same workload using different state objects stored within the program accessible memory 104. In yet another embodiment, a workload that is dependent on a primary workload which is currently being executed by the processor 102 can be automatically launched for execution within the processor 102 when the primary workload has been fully executed.



FIG. 2 is a timeline view 200 when launching work within the processing environment 100 of FIG. 1, according to one embodiment of the invention. As shown, there are three different steps for launching work on the processor 102, creating a state object 202, populating the state object 204 and triggering the workload execution 206.


The work launching API 108 provides functions that can be issued by the application program 110 for each of the above steps. For creating a state object at step 202, the work launching API 108 provides functions for initializing a specified portion of memory within the program accessible memory 104 that is to be allocated to a state object needed for executing a workload. The state object 208, 210 and 212 illustrate state objects that have been initialized by the application program 110. The structure of the state object may be pre-defined or may be dynamic based on a specification provided by the application program 110. For populating the state object at step 204, the work launching API 108 provides functions for setting different pre-determined pieces of state information within the state object. State information can include specifying a number of threads that will be executing the workload, memory management information or texture information in the case of graphics processing. Examples of specific functions providing by the work launching API 108 for setting state information in the state object are listed below. For triggering the workload execution at step 208, the work launching API 108 provides functions for submitting the state object and launching the execution of the workload using the state object within the processor 102.



FIG. 3 is a flow diagram of method steps for launching a workload generated by an application program on a processor, according to one embodiment of the invention. Although the method steps are described in conjunction with the system for FIG. 1, persons skilled in the art will understand that any system configured to perform the method steps, in any order, is within the scope of the invention.


The method 300 begins at step 302, where the work launching API 108 receives an instruction from the application program 110 to initialize a state object within the program accessible memory 104. The application program 110, via at least one function provided by the work launching API 108, specifies a size of memory to be allocated to the state object. In response, at step 304, the state object specified by the application program 110 is created within the program accessible memory 104.


At step 306, the work launching API 108 receives state information from the application program 110 for storing in the state object created at step 304. The application program 110, via at least one function provided by the work launching API 108, specifies the different pieces of state information that are to be set within the state object. In response, at step 308, the state object is populated with the state information specified by the application program 110.


At step 310, the work launching API 108 receives an indication from the application program 110 that a workload associated with the state object should be triggered within the processor 102. At step 314, the execution of the workload is triggered within the processor 102.



FIG. 4 is a conceptual diagram of an exemplary computing device 400 configured to implement one or more aspects of the present invention. The computing device 400 includes a central processing unit (CPU) 402, a system interface 404, a system memory 410, a GPU 450, a GPU local memory 460 and a display 470.


The CPU 402 connects to the system memory 410 and the system interface 404. The CPU 402 executes programming instructions stored in the system memory 410, operates on data stored in system memory 410 and communicates with the GPU 450 through the system interface 404, which bridges communication between the CPU 402 and GPU 450. In alternate embodiments, the CPU 402, GPU 450, system interface 404, or any combination thereof, may be integrated into a single processing unit. Further, the functionality of GPU 450 may be included in a chipset or in some other type of special purpose processing unit or co-processor. The system memory 410 stores programming instructions and data for processing by the CPU 402. The system memory 410 typically includes dynamic random access memory (DRAM) configured to either connect directly to the CPU 402 (as shown) or alternately, via the system interface 404. The GPU local memory 460 is any memory space accessible by the GPU 450 including local memory, system memory, on-chip memories, and peer memory. In some embodiments, the GPU 450 displays certain graphics images stored in the GPU local memory 460 on the display 470.


In one embodiment, the GPU 450 includes a number M of SPMs (not shown), where M≧1, each SPM configured to process one or more thread groups. The series of instructions transmitted to a particular GPU 450 constitutes a thread, as previously defined herein, and the collection of a certain number of concurrently executing threads across the parallel processing engines (not shown) within an SPM is referred to herein as a “warp” or “thread group.” As used herein, a “thread group” refers to a group of threads concurrently executing the same program on different input data, with one thread of the group being assigned to a different processing engine within an SPM. A thread group may include fewer threads than the number of processing engines within the SPM, in which case some processing engines will be idle during cycles when that thread group is being processed. A thread group may also include more threads than the number of processing engines within the SPM, in which case processing will take place over consecutive clock cycles. Since each SPM can support up to G thread groups concurrently, it follows that up to G*M thread groups can be executing in GPU 450 at any given time.


Additionally, a plurality of related thread groups may be active (in different phases of execution) at the same time within an SPM. This collection of thread groups is referred to herein as a “cooperative thread array” (“CTA”) or “thread array.” The size of a particular CTA is equal to m*k, where k is the number of concurrently executing threads in a thread group and is typically an integer multiple of the number of parallel processing engines within the SPM, and m is the number of thread groups simultaneously active within the SPM. The size of a CTA is generally determined by the programmer and the amount of hardware resources, such as memory or registers, available to the CTA.


The system memory 410 includes an application program 412, application data 414, the work launching API 108, a GPU driver 418 and GPU driver data 420. The application program 412 generates calls to a the work launching API 108 as previously described in order to create state objects within the GPU local memory 460 and trigger the execution of workloads on the GPU 450 using those state objects.


Table 1 includes a list of functions provided by the work launching API 108 for creating and populating state objects as well as triggering the execution of workloads on the processor 102.











TABLE 1





FUNCTION NAME
INPUTS
DESCRIPTION







Create State Object Function




launcherInitialize
launcher: Launcher
Initializes a state



memory to initialize.
object.



func: Device-side




function for the




launcher, or NULL.



Populate State Object Functions




launcherSetCtaWidth
launcher: Handle to
Set the width of each



initialized launcher.
CTA in threads, must



ctaWidth: Width of the
be >0, default is zero.



CTA.



launcherSetCtaHeight
launcher: Handle to
Set the height of each



initialized launcher.
CTA in threads, must



ctaHeight: Height of the
be >0, default is zero.



CTA.



launcherSetCtaDepth
launcher: Handle to
Set the depth of each



initialized launcher.
CTA in threads, must



ctaDepth: Depth of the
be >0, default is zero.



CTA.



launcherSetGridWidth
launcher: Handle to
Set the width of the



initialized launcher.
grid in CTAs, default is



gridWith: Width of the
zero.



grid.



launcherSetGridHeight
launcher: Handle to
Set the height of the



initialized launcher.
grid in CTAs, default is



gridHeight: Height of
zero.



the grid.



launcherSetSharedMemorySize
launcher: Handle to
Sets the size in bytes



initialized launcher.
of the dynamic shared



memSize: Size of
memory used by the



shared memory.
launched CTAs.


launcherSetRegisterCount
launcher: Handle to
Overrides the



initialized launcher.
compiler-generated



regCount: Count of
register count for the



registers.
launched CTAs.


launcherSetL1Configuration
launcher: Handle to
Sets the L1 Cache-



initialized launcher.
shared memory



I1Config: Particular L1
configuration required



condifguration.
by the launched CTAs.


launcherSetInvalidateTextureCache
launcher: Handle to
If true, invalidate the



initialized launcher.
texture cache (in the



bool: invalidate.
GPU memory) prior to




launching work. False




by default.


launcherSetInvalidateShaderCache
launcher: Handle to
If true, invalidate the



initialized launcher.
shader cache (in the



bool: invalidate.
GPU memory) prior to




launching work. False




by default.


launcherSetInvalidateConstantCache
launcher: Handle to
If true, invalidate the



initialized launcher.
constant cache (in the



bool: invalidate.
GPU memory) prior to




launching work. False




by default.


launcherSetParameterBuffer
launcher: Handle to
Sets the pointer to a



initialized launcher.
parameter buffer



dParameterBuffer:
containing the data for



pointer to parameter
the parameters in the



buffer.
kernel signature.


launcherSetExtraParameterBuffer
launcher: Handle to
Sets the pointer to an



initialized launcher.
additional memory



dExtraParameterBuffer:
buffer that the user



pointer to extra
can read from in the



parameter buffer.
launched task.


launcherSetAtCtaExitCallback
launcher: Handle to
Support to launch



initialized launcher.
grids of work directly



cbLauncher: Handle to
at CTA exit without



initialized callback
explicitly going through



launcher.
the command buffer.



cbParams: Pointer to




callback parameters.



launcherSetAtGridExitCallback
launcher: Handle to
Support to launch



initialized launcher.
grids of work directly



cbLauncher: Handle to
at grid exit without



initialized callback
explicitly going through



launcher.
the command buffer.



cbParams: Pointer to




callback parameters.



launcherSetQueueBuffer
launcher: Handle to
Specify queue storage



initialized launcher.
for queue-based



dQueueBuffer: pointer
launchers. Each



to queue buffer.
element in the queue




contains the varying




arguments to a CTA.


launcherSetQueueElementCount
launcher: Handle to
Specify the number of



initialized launcher.
elements in the queue



queueElementCount:
associated with the



Number of CTA
launcher.



elements in the queue




storage array



launcherSetQueueElementSize
launcher: Handle to
Specify the size of



initialized launcher.
each element in the



queueElementSize:
queue associated with



Size of each CTA
the launcher.



element in the queue




storage array



launcherSetLogicalSmDisabledMask
launcher: Handle to
Sets a mask that



initialized launcher.
determines the set of



smMask: A mask that
logical SM indices to



determines the set of
which CTAs can be



logical SM indices to
launched.



which CTAs can be




launched.



launcherSetPriority
launcher: Handle to
Sets the priority level



initialized launcher.
of this launcher.



priority: Priority of the




launcher having a




value between 0 and a




pre-determined value.



launcherSetAddToHeadOfPriorityLevel
launcher: Handle to
If true, the scheduler



initialized launcher.
will add the launcher



b: Boolean indicating
to the head of the



whether the priority of
‘priority level’ set with



the launcher should be
launcherSetPriority,



considered.
otherwise the launcher




is added to the tail.


Trigger Execution Functions




launcherFinalize
launcher: Handle to
Notify GPU that the



initialized launcher.
state object is




configured and ready




for work.


launcherReset
launcher: Handle to
Reset a state object to



initialized launcher.
allow its reuse.


launcherSubmitGrid
launcher: Handle to
Launch a grid of work



initialized launcher.
with grid




width * height * depth




CTAs for the specified




launcher.


launcherSubmitGridCommands
launcher: Handle to
Writes into the given



initialized launcher.
buffer the GPU



dstCmdBufSeg:
commands required to



Destination command
launch a grid of work



buffer segment.
for the previously




configured state




object.


launcherSubmitQueueElements
launcher: Handle to
Launch CTAs for a



initialized launcher.
queue-based launcher



elementStart: Element
using elements stored



index of first CTA to
in the associated



launch.
dQueueBuffer storage.



elementCount: Number




of element CTAs to




launch.



launcherInvalidateInstructionCache
launcher: Handle to
If true, invalidate the



initialized launcher.
instruction cache prior



b: Boolean indicating
to launching work.



whether instruction




cache should be




invalidated before the




work is launched.









While the forgoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. For example, aspects of the present invention may be implemented in hardware or software or in a combination of hardware and software. One embodiment of the invention may be implemented as a program product for use with a computer system. The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored. Such computer-readable storage media, when carrying computer-readable instructions that direct the functions of the present invention, are embodiments of the present invention.


In view of the foregoing, the scope of the present invention is determined by the claims that follow.

Claims
  • 1. A computer-implemented method for launching work on a processor, the method comprising: initializing, via an application programming interface (API), a first state object that is assigned to a first workload and resides within a memory region accessible to a program executing on the processor;populating the first state object with data that indicates a first number of cooperative thread arrays (CTAs) that are responsible for processing the first workload that is generated by the program, wherein the first number is greater than or equal to two;generating the first number of CTAs in order to process the first workload on the processor according to the data within the first state object;launching, via the API, the first number of CTAs for a queue-based launcher using data elements stored in a queue buffer; andinvalidating, via the API, an instruction cache associated with the first data object prior to launching the first number of CTAs;wherein a structure of the first state object is dynamic.
  • 2. The computer-implemented method of claim 1, wherein the step of initializing the first state object is performed when a first instruction is received from the program.
  • 3. The computer-implemented of claim 2, wherein the first instruction specifies a portion of the memory region to be allocated to the first state object.
  • 4. The computer-implemented method of claim 3, further comprising deallocating the portion of the memory region allocated to the first state object once the first workload has been processed.
  • 5. The computer-implemented method of claim 1, wherein the data associated with the first workload comprises state information necessary to process the first workload.
  • 6. The computer-implemented method of claim 5, wherein the state information is accessed when a second workload generated by the program is processed.
  • 7. The computer-implemented method of claim 1, wherein the processor comprises a plurality of processing cores and processes the first workload with a first processing core.
  • 8. The computer-implemented method of claim 7, further comprising: initializing a second state object within the memory region; andpopulating the second state object with additional data associated with the first workload,wherein the first workload is processed according to the additional data within the second state object by a second processing core of the processor.
  • 9. The computer-implemented method of claim 1, wherein a second workload that is generated by the program is dependent on a result generated from the processing of the first workload, and further comprising automatically triggering the processing of the second workload on the processor according to the result when the first workload has been processed.
  • 10. The computer-implemented method of claim 1, wherein the result of the execution of the first workload is stored in the memory region.
  • 11. The computer-implemented method of claim 1, wherein the data included in the first state object further indicates a second number of threads in each cooperative thread array, and wherein generating the first number of CTAs further comprises generating the second number of threads for each CTA in order to process the first workload.
  • 12. The computer-implemented method of claim 1, wherein the API includes a first function configured to initialize the first state object.
  • 13. The computer-implemented method of claim 12, wherein the API further includes a second function configured to populate the first state object with the data and a third function configured to launch the first workload.
  • 14. The computer-implemented method of claim 13, wherein the API further includes a fourth function configured to set parameters associated with the first number of CTAs.
  • 15. The computer-implemented method of claim 14, wherein: initializing the first state object comprises invoking the first function;populating the first state object comprises invoking the second function; andgenerating the first number of CTAs comprises invoking the fourth function.
  • 16. The computer-implemented method of claim 1, further comprising resetting, via the API, the first state object such that the first state object can be reused.
  • 17. The computer-implemented method of claim 1, wherein the structure of the state object is specified by the program.
  • 18. A non-transitory computer readable medium storing instructions that, when executed by a processor, cause the processor to launch work, by performing the steps of: initializing, via an application programming interface (API), a first state object that is assigned to a first workload and resides within a memory region accessible to a program executing on the processor;populating the first state object with data that indicates a first number of cooperative thread arrays (CTAs) that are responsible for processing the first workload that is generated by the program;generating the first number of CTAs in order to process the first workload on the processor according to the data within the first state object; andinvalidating, via the API, a cache associated with the first data object prior to launching the first number of CTAs.
  • 19. The non-transitory computer readable medium of claim 18, wherein the step of initializing the first state object is performed when a first instruction is received from the program.
  • 20. The non-transitory computer readable medium of claim 19, wherein the first instruction specifies a portion of the memory region to be allocated to the first state object.
  • 21. The non-transitory computer readable medium of claim 20, further comprising deallocating the portion of the memory region allocated to the first state object once the first workload has been processed.
  • 22. The non-transitory computer readable medium of claim 18, wherein the data associated with the first workload comprises state information necessary to process the first workload.
  • 23. The non-transitory computer readable medium of claim 22, wherein the state information is accessed when a second workload generated by the program is processed.
  • 24. The non-transitory computer readable medium of claim 18, wherein the processor comprises a plurality of processing cores and processes the first workload with a first processing core.
  • 25. The non-transitory computer readable medium of claim 24, further comprising: initializing a second state object within the memory region; andpopulating the second state object with additional data associated with the first workload,wherein the first workload is processed according to the additional data within the second state object by a second processing core of the processor.
  • 26. The non-transitory computer readable medium of claim 18, wherein a second workload that is generated by the program is dependent on a result generated from the processing of the first workload, and further comprising automatically triggering the processing of the second workload on the processor according to the result when the first workload has been processed.
  • 27. A computer system, comprising: a memory; anda processor that: initializes, via an application programming interface (API), a first state object that is assigned to a first workload and resides within a memory region accessible to a program executing on the processor,populates the first state object with data that indicates a first number of cooperative thread arrays (CTAs) that are responsible for processing the first workload that is generated by the program,generates the first number of CTAs in order to process the first workload on the processor according to the data within the first state object, andinvalidates, via the API, a cache associated with the first data object prior to launching the first number of CTAs.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims benefit of United States provisional patent application entitled “CULauncher API for Computer GWC” filed on Apr. 5, 2010 and having a Ser. No. 61/321,096.

US Referenced Citations (9)
Number Name Date Kind
6148323 Whitner et al. Nov 2000 A
7526634 Duluk et al. Apr 2009 B1
7577762 Garlick et al. Aug 2009 B1
7594095 Nordquist Sep 2009 B1
7640284 Goodnight et al. Dec 2009 B1
7681077 Eitzmann et al. Mar 2010 B1
20080018652 Toelle et al. Jan 2008 A1
20100262975 Reysa et al. Oct 2010 A1
20120096040 Schreter Apr 2012 A1
Related Publications (1)
Number Date Country
20110247018 A1 Oct 2011 US
Provisional Applications (1)
Number Date Country
61321096 Apr 2010 US