In computing devices, graphics processing units (GPUs) often supplement the central processing unit (CPU) by providing electronic circuitry that can perform mathematical operations rapidly. To do this, GPUs utilize extensive parallelism and many concurrent threads to overcome the latency of memory requests and computing. The capabilities of GPUs make them useful to accelerate high-performance graphics processing and parallel computing tasks. For instance, a GPU can accelerate the processing of two-dimensional (2D) or three-dimensional (3D) images in a surface for media or 3D applications.
Computer programs can be written specifically for the GPU. Examples of GPU applications include video encoding/decoding, three-dimensional games and other general purpose computing applications. The programming interfaces to GPUs are made up of two parts: one is a high-level programming language, which allows the developer to write programs to run on GPUs, and includes the corresponding compiler software, which compiles and generates the GPU-specific instructions (e.g., binary code) for the GPU programs. A set of GPU-specific instructions, which makes up a program that is executed by the GPU, may be referred to as a programmable workload or “kernel.” The other part of the host programming interface is the host runtime library, which runs on the CPU side and provides a set of APIs to allow the user to launch the GPU programs to GPU for execution. The two components work together as a GPU programming framework. Examples of such frameworks include, for example, the Open Computing Language (OpenCL), DirectX by Microsoft, and CUDA by NVIDIA. Depending on the application, multiple GPU workloads may be required to complete a single GPU task, such as image processing. The CPU runtime submits every workload to the GPU one by one by making up a GPU command buffer and passing it to GPU by a direct memory access (DMA) mechanism. The GPU command buffer may be referred to as a “DMA packet” or “DMA buffer.” Each time the GPU completes its processing of a DMA packet, the GPU issues an interrupt to the CPU. The CPU handles the interrupt by an interrupt service routine (ISR) and schedules a corresponding deferred procedure call (DPC). Existing runtimes, including OpenCL, submit each workload to the GPU as a separate DMA packet. Thus, with existing techniques, an ISR and a DPC are associated with every workload, at least.
The concepts described herein are illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. Where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements.
While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will be described herein in detail. It should be understood, however, that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives consistent with the present disclosure and the appended claims.
References in the specification to “one embodiment,” “an embodiment,” “an illustrative embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. Additionally, it should be appreciated that items included in a list in the form of “at least one A, B, and C” can mean (A); (B); (C); (A and B); (B and C); (A and C); or (A, B, and C). Similarly, items listed in the form of “at least one of A, B, or C” can mean (A); (B); (C): (A and B); (B and C); (A and C); or (A, B, and C).
The disclosed embodiments may be implemented, in some cases, in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on a transitory or non-transitory machine-readable (e.g., computer-readable) storage medium, which may be read and executed by one or more processors. A machine-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disc, or other media device).
In the drawings, some structural or method features may be shown in specific arrangements and/or orderings. However, it should be appreciated that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, such features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of a structural or method feature in a particular figure is not meant to imply that such feature is required in all embodiments and, in some embodiments, may not be included or may be combined with other features.
Referring now to
The computing device 100 may be embodied as any type of device for performing the functions described herein. For example, the computing device 100 may be embodied as, without limitation, a smart phone, a tablet computer, a wearable computing device, a laptop computer, a notebook computer, a mobile computing device, a cellular telephone, a handset, a messaging device, a vehicle telematics device, a server computer, a workstation, a distributed computing system, a multiprocessor system, a consumer electronic device, and/or any other computing device configured to perform the functions described herein. As shown in
The CPU 120 may be embodied as any type of processor capable of performing the functions described herein. For example, the CPU 120 may be embodied as a single or multi-core processor(s), digital signal processor, microcontroller, or other processor or processing/controlling circuit. The GPU 160 is embodied as any type of graphics processing unit capable of performing the functions described herein. For example, the GPU 160 may be embodied as a single or multi-core processor(s), digital signal processor, microcontroller, floating-point accelerator, co-processor, or other processor or processing/controlling circuit designed to rapidly manipulate and alter data in memory. The GPU 160 includes a number of execution units 162. The execution units 162 may be embodied as an array of processor cores or parallel processors, which can execute a number of parallel threads. In various embodiments of the computing device 100, the GPU 160 may be embodied as a peripheral device (e.g., on a discrete graphics card), or may be located on the CPU motherboard or on the CPU die.
The CPU memory 126 and the GPU memory 164 may each be embodied as any type of volatile or non-volatile memory or data storage capable of performing the functions described herein. In operation, the memory 126, 164 may store various data and software used during operation of the computing device 100 such as operating systems, applications, programs, libraries, and drivers. For example, portions of the CPU memory 126 at least temporarily store command buffers and DMA packets that are created by the CPU 120 as disclosed herein, and portions of the GPU memory 164 at least temporarily store the DMA packets, which are transferred by the CPU 120 to the GPU memory 164 by the direct memory access subsystem 124.
The CPU memory 126 is communicatively coupled to the CPU 120, e.g., via the I/O subsystem 122, and the GPU memory 164 is similarly communicatively coupled to the GPU 160. The I/O subsystem 122 may be embodied as circuitry and/or components to facilitate input/output operations with the CPU 120, the CPU memory 126, the GPU 160 (and/or the execution units 162), the GPU memory 164, and other components of the computing device 100. For example, the I/O subsystem 122 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, firmware devices, communication links (i.e., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc.) and/or other components and subsystems to facilitate the input/output operations. In some embodiments, the I/O subsystem 122 may form a portion of a system-on-a-chip (SoC) and be incorporated, along with the CPU 120, the CPU memory 126, the GPU 160, the GPU memory 164, and/or other components of the computing device 100, on a single integrated circuit chip.
The illustrative I/O subsystem 122 includes a direct memory access (DMA) subsystem 124, which facilitates data transfer between the CPU memory 126 and the GPU memory 164. In some embodiments, the I/O subsystem 122 (e.g., the DMA subsystem 124) allows the GPU 160 to directly access the CPU memory 126 and allows the CPU 120 to directly access the GPU memory 164. The DMA subsystem 124 may be embodied as a DMA controller or DMA “engine,” such as a Peripheral Component Interconnect (PCI) device, a Peripheral Component Interconnect-Express (PCI-Express) device, an I/O Acceleration Technology (I/OAT) device, and/or others.
The data storage device 128 may be embodied as any type of device or devices configured for short-term or long-term storage of data such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, or other data storage devices. The data storage device 128 may include a system partition that stores data and firmware code for the computing device 100. The data storage device 128 may also include an operating system partition that stores data files and executables for an operating system 140 of the computing device 100.
The display 130 may be embodied as any type of display capable of displaying digital information such as a liquid crystal display (LCD), a light emitting diode (LED), a plasma display, a cathode ray tube (CRT), or other type of display device. In some embodiments, the display 130 may be coupled to a touch screen or other user input device to allow user interaction with the computing device 100. The display 130 may be part of a user interface subsystem 136. The user interface subsystem 136 may include a number of additional devices to facilitate user interaction with the computing device 100, including physical or virtual control buttons or keys, a microphone, a speaker, a unidirectional or bidirectional still and/or video camera, and/or others. The user interface subsystem 136 may also include devices, such as motion sensors, proximity sensors, and eye tracking devices, which may be configured to detect, capture, and process various other forms of human interactions involving the computing device 100.
The computing device 100 further includes communication circuitry 134, which may be embodied as any communication circuit, device, or collection thereof, capable of enabling communications between the computing device 100 and other electronic devices. The communication circuitry 134 may be configured to use any one or more communication technology (e.g., wireless or wired communications) and associated protocols (e.g., Ethernet, Bluetooth®, Wi-Fi®, WiMAX, 3G/LTE, etc.) to effect such communication. The communication circuitry 134 may be embodied as a network adapter, including a wireless network adapter.
The illustrative computing device 100 also includes a number of computer program components, such as a device driver 132, an operating system 140, a user space driver 142, and a graphics subsystem 144. Among other things, the operating system 140 facilitates the communication between user space applications, such as GPU applications 210 (
In the illustrative embodiment, the user space driver 142 and the device driver 132 cooperate as a “driver pair,” and handle communications between user space applications, such as GPU applications 210 (
The graphics subsystem 144 facilitates communications between the user space driver 142, the device driver 132, and one or more user space applications, such as the GPU applications 210. The graphic subsystem 144 may be embodied as any type of computer program subsystem capable of performing the functions described herein, such as an application programming interface (API) or suite of APIs, a combination of APIs and runtime libraries, and/or other computer program components. Examples of graphics subsystems include the Media Development Framework (MDF) runtime library by Intel Corporation, OpenCL runtime library, and the DirectX Graphics Kernel Subsystem and Windows Display Driver Model by Microsoft Corporation.
The illustrative graphics subsystem 144 includes a number of computer program components, such as a GPU scheduler 146, an interrupt handler 148, and the batch submission subsystem 150. The GPU scheduler 146 communicates with the device driver 132 to control the submission of DMA packets in a working queue 212 (
Referring now to
The batch submission mechanism 150 includes program code that enables the creation of the command buffer as disclosed herein. An example of a method 400 that may be implemented by the program code of the batch submission mechanism 150 to create the command buffer is shown in
In Code Example 1, the setup command may include GPU commands to prepare the information that the GPU 160 needs to execute the workloads on the execution units 162. Such commands may include, for example, cache configuration commands, surface state setup commands, media state setup commands, pipe control commands, and/or others. The media object walker command causes the GPU 160 to dispatch multiple threads running on the execution units 162, for the workload identified as a parameter in the command. The pipe control command ensures that all of the preceding commands finish executing before the GPU finishes execution of the command buffer. Thus, the GPU 160 only generates one interrupt (ISR), at the completion of the processing of all of the individually-dispatched workloads contained in the command buffer. In return, the CPU 120 only generates one deferred procedure call (DPC). In this way, multiple workloads contained in one command buffer only generate one ISR and one DPC.
For comparison purposes, an example of pseudo code for a command buffer that may be created by existing techniques (such as current versions of OpenCL) for multiple workloads, without synchronization, is shown in Code Example 2 below.
In Code Example 2, the setup commands may be similar to those described above. However and multiple workloads are combined manually by a developer (e.g., a GPU programmer) into a single workload, which is then dispatched to the GPU 160 by a single media object walker command. Although a single DMA packet is created from the Code Example 2, resulting in one IPC and DPC, the merged workload is much larger than the separate workloads taken individually. Such a large workload can strain the hardware resources of the GPU 160 (e.g., the GPU instruction cache and/or registers). As noted above, a known alternative to the manual merging of workloads is to create separate DMA packets for each workload; however, separate DMA packets result in many more IPCs and DPCs than a single DMA packet containing multiple workloads as disclosed herein.
In the workload synchronization working mode, the batch submission mechanism 150 creates the command buffer to separately dispatch each of the workloads to the GPU 160 in the same command buffer, and the synchronization mechanism 152 inserts a synchronization command between the workload dispatch commands to ensure that the workload dependency conditions are met. To do this, the batch submission mechanism 150 inserts one dispatch command into the command buffer for each workload and the synchronization mechanism 152 inserts the appropriate pipe control command after each dispatch command, as needed. An example of pseudo code for a command buffer that may be created by the batch submission mechanism 150 (including the synchronization mechanism 152) for multiple workloads, with synchronization, is shown in Code Example 3 below.
In Code Example 3, the setup commands and media object walker commands are similar to those described above with reference to Code Example 1. The pipe control (sync) command includes parameters that identify to the pipe control command the workloads that have a dependency condition. For example, the pipe control (sync 2,1) command ensures that the media object walker (Workload 1) command finishes executing before the GPU 160 begins execution of the media object walker (Workload 2) command. Similarly, the pipe control (sync 3,2) command ensures that the media object walker (Workload 2) command finishes executing before the GPU 160 begins execution of the media object walker (Workload 3) command.
Referring now to
At block 316, the computing device 100 (e.g., the CPU 120) prepares the DMA packet from the command buffer, including the batched workloads. To do this, the illustrative device driver 132 validates the command buffer and writes the DMA packet in the device-specific format. In embodiments in which the command buffer is embodied as human-readable program code, the computing device 100 converts the human-readable commands in the command buffer to machine-readable instructions that can be executed by the GPU 160. Thus, the DMA packet contains machine-readable instructions, which may correspond to human-readable commands contained in the command buffer. At block 318, the computing device 100 (e.g., the CPU 120) submits the DMA packet to the GPU 160 for execution. To do this, the computing device (e.g., the CPU 120, by the GPU scheduler 146 in coordination with the device driver 132) assigns memory addresses to the resources in the DMA packet, assigns a unique identifier to the DMA packet (e.g., a buffer fence ID), and queues the DMA packet to the GPU 160 (e.g., to an execution unit 162).
At block 320, the computing device 100 (e.g., the GPU 160) processes the DMA packet with the batched workloads. For example, the GPU 160 may process each workload on a different execution unit 162 using multiple threads. When the GPU 160 finishes processing the DMA packet (subject to any synchronization commands that may be included in the DMA packet), the GPU 160 generates an interrupt, at block 322. The interrupt is received by the CPU 120 (by, e.g., the interrupt handler 148). At block 324, the computing device 100 (e.g., the CPU 120) determines whether the processing of the DMA packet by the GPU 160 is complete. To do this, the device driver 132 evaluates the interrupt information, including the identifier (e.g., buffer fence ID) of the DMA packet just completed. If the device driver 132 concludes that the processing of the DMA packet by the GPU 160 has finished, the device driver 132 notifies the graphics subsystem 144 (e.g., the GPU scheduler 146) that the DMA packet processing is complete, and queues a deferred procedure call (DPC). At block 326, the computing device 100 (e.g., the CPU 120) notifies the GPU scheduler 146 that the DPC has completed. To do this, the DPC may call a callback function provided by the GPU scheduler 146. In response to the notification that the DPC is complete, the computing device (e.g., the CPU 120, by the GPU scheduler 146) schedules the next GPU task in the working queue 212 for processing by the GPU 160.
Referring now to
At block 420, the computing device 100 determines whether workload synchronization is required. To do this, the computing device 100 determines whether the output of the first workload is used as input to any other workloads (e.g., by examining parameters or arguments of the create workload commands). If synchronization is needed, the computing device inserts the synchronization command in the command buffer after the create workload command. For example, with the Media Development Framework runtime APIs, a pCmTask->AddSync( ) API may be used. At block 424, the computing device 100 determines whether there is another workload to be added to the command buffer. If there is another workload to be added to the command buffer, the computing device 100 returns to block 418 and adds the workload to the command buffer. If there are no more workloads to be added to the command buffer, the computing device 100 creates the DMA packet and submits the DMA packet to the working queue 212. The GPU scheduler 146 will submit the DMA packet to the GPU 160 if the GPU 160 is currently available to process the DMA packet, at block 426. At block 428, the computing device 100 (e.g., the CPU 120) waits for a notification from the GPU 160 that the GPU 160 has completed executing the DMA packet, and the method 400 ends. Following block 428, the computing device 100 may initiate the creation of another command buffer as described above.
Table 1 below illustrates experimental results that were obtained after applying the disclosed batch submission mechanism to a perceptual computing application with synchronization.
As shown in Table 1, performance gains have been realized after applying the batch submission mechanism disclosed herein to process multiple synchronized GPU workloads in one DMA packet, in a perceptual computing application. These results suggests that the GPU 160 is better utilized by the CPU 120 when the disclosed batch submission mechanism is used, which should lead to reductions in system power consumption. These results may be attributed to, among other things, the reduced number of IPCs and DPC calls, as well as the smaller number of DMA packets needing to be scheduled.
Illustrative examples of the technologies disclosed herein are provided below. An embodiment of the technologies may include any one or more, and any combination of, the examples described below.
Example 1 includes a computing device for executing programmable workloads, the computing device comprising a central processing unit to create a direct memory access packet, the direct memory access packet comprising a separate dispatch instruction for each of the programmable workloads; a graphics processing unit to execute the programmable workloads, each of the programmable workloads comprising a set of graphics processing unit instructions; wherein each of the separate dispatch instructions in the direct memory access packet is to initiate processing by the graphics processing unit of one of the programmable workloads; and a direct memory access subsystem to communicate the direct memory access packet from memory accessible by the central processing unit to memory accessible by the graphics processing unit.
Example 2 includes the subject matter of Example 1, wherein the central processing unit is to create a command buffer comprising dispatch commands embodied in human-readable computer code, and the dispatch instructions in the direct memory access packet correspond to the dispatch commands in the command buffer.
Example 3 includes the subject matter of Example 2, wherein the central processing unit executes a user space driver to create the command buffer and the central processing unit executes a device driver to create the direct memory access packet.
Example 4 includes the subject matter of any of Examples 1-3, wherein the central processing unit is to create a first type of direct memory access packet for programmable workloads that have a dependency relationship and a second type of direct memory access packet for programmable workloads that do not have a dependency relationship, wherein the first type of direct memory access packet is different than the second type of direct memory access packet.
Example 5 includes the subject matter of Example 4, wherein the first type of direct memory access packet comprises a synchronization instruction between two of the dispatch instructions, and the second type of direct memory access packet does not comprise any synchronization instructions between the dispatch instructions.
Example 6 includes the subject matter of any of Examples 1-3, wherein each of the dispatch instructions in the direct memory access packet is to initiate processing of one of the programmable workloads by an execution unit of the graphics processing unit.
Example 7 includes the subject matter of any of Examples 1-3, wherein the direct memory access packet comprises a synchronization instruction to ensure that execution of one of the programmable workloads by the graphics processing unit finishes before the graphics processing unit begins execution of another of the programmable workloads.
Example 8 includes the subject matter of any of Examples 1-3, wherein each of the programmable workloads comprises instructions to execute a graphics processing unit task requested by a user space application.
Example 9 includes the subject matter of Example 8, wherein the user space application comprises a perceptual computing application.
Example 10 includes the subject matter of Example 8, wherein the graphics processing unit task comprises processing of a frame of a digital video.
Example 11 includes a computing device for submitting programmable workloads to a graphics processing unit, each of the programmable workloads comprising a set of graphics processing unit instructions, the computing device comprising: a graphics subsystem to facilitate communication between a user space application and the graphics processing unit; and a batch submission mechanism to create a single command buffer comprising separate dispatch commands for each of the programmable workloads, wherein each of the separate commands in the direct memory access packet is to separately initiate processing by the graphics processing unit of one of the programmable workloads.
Example 12 includes the subject matter of Example 11, and comprises a device driver to create a direct memory access packet, the direct memory access packet comprising graphics processing unit instructions corresponding to the dispatch commands in the command buffer.
Example 13 includes the subject matter of Example 11 or Example 12, wherein the dispatch commands are to cause the graphics processing unit to execute all of the programmable workloads in parallel.
Example 14 includes the subject matter of Example 11 or Example 12, and comprises a synchronization mechanism to insert into the command buffer a synchronization command to cause the graphics processing unit to complete execution of a programmable workload before beginning the execution of another programmable workload.
Example 15 includes the subject matter of Example 14, wherein the synchronization mechanism is embodied as a component of the batch submission mechanism.
Example 16 includes the subject matter of any of Examples 11-13, wherein the batch submission mechanism is embodied as a component of the graphics subsystem.
Example 17 includes the subject matter of Example 16, wherein the graphics subsystem is embodied as one or more of: an application programming interface, a plurality of application programming interfaces, and a runtime library.
Example 18 includes a method for submitting programmable workloads to a graphics processing unit, the method comprising, with a computing device: creating a command buffer; adding a plurality of dispatch commands to the command buffer, each of the dispatch commands to initiate execution of one of the programmable workloads by a graphics processing unit of the computing device; and creating a direct memory access packet comprising graphics processing unit instructions corresponding to the dispatch commands in the command buffer.
Example 19 includes the subject matter of Example 18, and comprises communicating the direct memory access packet to memory accessible by the graphics processing unit.
Example 20 includes the subject matter of Example 18, and comprises inserting a synchronization command between two of the dispatch commands in the command buffer, wherein the synchronization command is to ensure that the graphics processing unit completes the processing of one of the programmable workloads before the graphics processing unit begins processing another of the programmable workloads.
Example 21 includes the subject matter of Example 18, and comprises formulating each of the dispatch commands to create a set of arguments for one of the programmable workloads.
Example 22 includes the subject matter of Example 18, and comprises formulating each of the dispatch commands to create a thread space for one of the programmable workloads.
Example 23 includes the subject matter of any of Examples 18-23, and comprises, by a direct memory access subsystem of the computing device, transferring the direct memory access packet from memory accessible by the central processing unit to memory accessible by the graphics processing unit.
Example 24 includes a computing device comprising the central processing unit, the graphics processing unit, and memory having stored therein a plurality of instructions that when executed by the central processing unit cause the computing device to perform the method of any of Examples 18-23.
Example 25 includes one or more machine readable storage media comprising a plurality of instructions stored thereon that in response to being executed result in a computing device performing the method of any of Examples 18-23.
Example 26 includes a computing device comprising means for performing the method of any of Examples 18-23.
Example 27 includes a method for executing programmable workloads, the method comprising, with a computing device: by a central processing unit of the computing device, creating a direct memory access packet, the direct memory access packet comprising a separate dispatch instruction for each of the programmable workloads; by a graphics processing unit of the computing device, executing the programmable workloads, each of the programmable workloads comprising a set of graphics processing unit instructions; wherein each of the separate dispatch instructions in the direct memory access packet is to initiate processing by the graphics processing unit of one of the programmable workloads; and by a direct memory access subsystem of the computing device, communicating the direct memory access packet from memory accessible by the central processing unit to memory accessible by the graphics processing unit.
Example 28 includes the subject matter of Example 27, and comprises, by the central processing unit, creating a command buffer comprising dispatch commands embodied in human-readable computer code, wherein the dispatch instructions in the direct memory access packet correspond to the dispatch commands in the command buffer.
Example 29 includes the subject matter of Example 28, and comprises, by the central processing unit, executing a user space driver to create the command buffer, wherein the central processing unit executes a device driver to create the direct memory access packet.
Example 30 includes the subject matter of any of Examples 27-29, and comprises, by the central processing unit, creating a first type of direct memory access packet for programmable workloads that have a dependency relationship and creating a second type of direct memory access packet for programmable workloads that do not have a dependency relationship, wherein the first type of direct memory access packet is different than the second type of direct memory access packet.
Example 31 includes the subject matter of Example 30, wherein the first type of direct memory access packet comprises a synchronization instruction between two of the dispatch instructions, and the second type of direct memory access packet does not comprise any synchronization instructions between the dispatch instructions.
Example 32 includes the subject matter of any of Examples 27-29, and comprises, by each of the dispatch instructions in the direct memory access packet, initiating processing of one of the programmable workloads by an execution unit of the graphics processing unit.
Example 33 includes the subject matter of any of Examples 27-29, and comprises, by a synchronization instruction in the direct memory access packet, ensuring that execution of one of the programmable workloads by the graphics processing unit finishes before the graphics processing unit begins execution of another of the programmable workloads.
Example 34 includes the subject matter of any of Examples 27-29, and comprises, by each of the programmable workloads, executing a graphics processing unit task requested by a user space application.
Example 35 includes the subject matter of Example 34, wherein the user space application comprises a perceptual computing application.
Example 36 includes the subject matter of Example 34, wherein the graphics processing unit task comprises processing of a frame of a digital video.
Example 37 includes a computing device comprising the central processing unit, the graphics processing unit, the direct memory access subsystem, and memory having stored therein a plurality of instructions that when executed by the central processing unit cause the computing device to perform the method of any of Examples 27-36.
Example 38 includes one or more machine readable storage media comprising a plurality of instructions stored thereon that in response to being executed result in a computing device performing the method of any of Examples 27-36.
Example 39 includes a computing device comprising means for performing the method of any of Examples 27-36.
Example 40 includes method for submitting programmable workloads to a graphics processing unit of a computing device, each of the programmable workloads comprising a set of graphics processing unit instructions, the method comprising: by a graphics subsystem of the computing device, facilitating communication between a user space application and the graphics processing unit; and by a batch submission mechanism of the computing device, creating a single command buffer comprising separate dispatch commands for each of the programmable workloads, wherein each of the separate commands in the direct memory access packet is to separately initiate processing by the graphics processing unit of one of the programmable workloads.
Example 41 includes the subject matter of Example 40, and comprises, by a device driver of the computing device, creating a direct memory access packet, wherein the direct memory access packet comprises graphics processing unit instructions corresponding to the dispatch commands in the command buffer.
Example 42 includes the subject matter of Example 40 or Example 41, and comprises, by the dispatch commands, causing the graphics processing unit to execute all of the programmable workloads in parallel.
Example 43 includes the subject matter of Example 40 or Example 41, and comprises, by a synchronization mechanism of the computing device, inserting into the command buffer a synchronization command to cause the graphics processing unit to complete execution of a programmable workload before the graphics processing unit begins the execution of another programmable workload.
Example 44 includes the subject matter of Example 43, wherein the synchronization mechanism is embodied as a component of the batch submission mechanism.
Example 45 includes the subject matter of any of Examples 40-44, wherein the batch submission mechanism is embodied as a component of the graphics subsystem.
Example 46 includes the subject matter of any of Examples 40-44, wherein the graphics subsystem is embodied as one or more of: an application programming interface, a plurality of application programming interfaces, and a runtime library.
Example 47 includes a computing device comprising the central processing unit, the graphics processing unit, the direct memory access subsystem, and memory having stored therein a plurality of instructions that when executed by the central processing unit cause the computing device to perform the method of any of Examples 40-46.
Example 48 includes one or more machine readable storage media comprising a plurality of instructions stored thereon that in response to being executed result in a computing device performing the method of any of Examples 40-46.
Example 49 includes a computing device comprising means for performing the method of any of Examples 40-46.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2014/072310 | 2/20/2014 | WO | 00 |