Computing techniques been developed to allow general purpose operations to be performed in a GPU (graphics processing unit). A GPU has a large number of simple parallel processing pipelines that are optimized for graphics processing. By moving general purpose operations that require many similar or identical parallel calculations to the GPU, these operations can be performed more quickly than on the CPU (Central Processing Unit) while processing demands on the CPU are reduced. This can reduce power consumption while improving performance.
However, the command buffers and command streamers of GPUs are not designed to optimize the transfer of intermediate values and commands between the CPU and GPU. GPUs frequently use separate memory storage and cache resources that are isolated from the CPU. GPUs are also optimized for sending final results to frame buffers for rendering images rather than being sent back a CPU for further processing.
Intel® 3D (Three-Dimensional) or GPGPU (General Purpose Graphics Processing Unit) driver software dispatches workloads to GPU (Graphics Processing Unit) hardware in a quantum of a command buffer by programming a MI_BATCH_BUFFER_START command in a ring buffer. In certain usage models, the driver processes the statistics outputted by the command buffer to evaluate the condition of the command buffers and then the driver determines whether to dispatch or skip the subsequent depended command buffers. This driver determination creates a latency degrading the performance of the commands because of the transfer of control from hardware of the command streamer and arithmetic logic unit to software of the driver and back to hardware again.
The GPGPU driver waits for the previously dispatched command buffer execution to be completed before it evaluates the condition out of the statistics output by the completed command buffer. Based on the evaluated condition, the driver decides if the subsequent dependent command buffer is to be executed or skipped.
Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements.
Embodiments of the present invention provide a mechanism in GPU hardware, such as a command streamer to evaluate the conditions of predicate registers, such as command buffers, and skip the subsequent depended command buffers without software intervention. The mechanism can evaluate predicates on the fly avoiding a transfer of control from hardware to software. A generalized and programmable hardware component provides assistance to the software in providing self modifying command stream execution.
In one example, a “Predication Enable” control field is provided in a command executed before the start of the execution of a command sequence, such as a sequence loaded into a batch buffer. In the described example, this is referred to as a “MI_BATCH_BUFFER_START” command, where MI refers to memory interface. The control field, such as a flag, when parsed by a command streamer, indicates that the MI_BATCH— BUFFER_START command should be skipped based on a value in a predicate register. In the described example, this is referred to as a “PR_RESULT—1” register. In the described embodiment, there is also a “PR_RESULT—0” register that is used to predicate a 3DPRIMITIVE command. This command is used to trigger rendering in a 3D engine 216 of the GPU 201 shown in
Command buffers, such as a particular MI_BATCH_BUFFER_START command can be skipped conditionally depending upon a predicate register value, such as the PR_RESULT—1 value. A predication control field, such as a PREDICATION ENABLE field in the MI_BATCH_BUFFER_START command indicates whether the hardware is to use predication to determine whether to skip the command. When predication is enabled, then the hardware is to either skip or not skip the batch buffer depending on the PR_RESULT—1 value. When predication is not enabled, then the command is executed without reference to the predication register.
The PR_RESULT—1 value can be produced in any of a variety of different ways. In one example, it is the output of an MMIO (Memory-Mapped Input/Output) register. This MMIO register can be exercised just as any other GPU register. Any expression consisting of logical and arithmetic functions can be evaluated with the help of appropriate commands, such as an MI_MATH command, in the command streamer and the result can be subsequently moved to the PR_RESULT—1 value. The MI_MATH command can be retrieved from a ring buffer or a command buffer to provide an ability to execute any logical or arithmetic expression. The logical and arithmetic expression can be executed using hardware logic in the command streamer based on ALU instructions delivered as payload in the MI_MATH command.
The embodiments described herein are described in the context of several specific commands and registers. These commands and registers are taken from the particular context of Intel® GPGPU, however, different commands and registers may be used instead of those named herein. These different commands and registers may be taken from GPGPU or from another context for executing commands through a command streamer and an arithmetic logic unit.
Start command:
The MI_BATCH_BUFFER_START command is used to initiate the execution of commands stored in a batch buffer. The command indicates the batch buffer at which execution is to start and provides values for the needed state registers including memory state selections. It can be executed directly from a ring buffer. The execution can be stopped either with an end command or with a new start command that points to a different batch buffer. In GPGPU contexts, this command can be an existing command.
Predication:
A new PREDICATION ENABLE control field can be added to the MI_BATCH_BUFFER_START command or a similar command. The MI_BATCH_BUFFER_START command has a pointer to the command buffer in memory which needs to be fetched and executed. This buffer will indicate the condition to apply to the predication register. If the PR_RESULT—1 value is not set, then the command streamer, on parsing the MI_BATCH_BUFFER_START command with the Predication Enable Field set, skips the command. In other words, it does not execute the command in the buffer pointed to by MI_BATCH_BUFFER_START command. If the PR_RESULT—1 value is set, then the command streamer executes the command resulting in execution of the command in the buffer that the MI_BATCH_BUFFER_START command points to.
Software can be used to program all of the command buffers and all of the dependent command buffers in the ring buffer as a single dispatch. The predication enable field can be set for the command buffers that need to be predicated in this same dispatch. This can easily be done by including the predication enable field in the command to which the field applies. However, predication can also be enabled in other ways. Software computes the PR_RESULT—1 value by programming an MI_MATH command in the ring buffer before using the PR_RESULT—1 register for predication of the subsequent command buffers. Software can then reprogram the MI_MATH command whenever the PR_RESULT—1 value has to be recomputed.
Math Command:
A math command can be used to carry ALU (Arithmetic Logic Unit) instructions as a payload to be executed on an ALU. A graphics command streamer, on parsing the math command, outputs a new set of ALU instructions to the ALU block on every clock. If the ALU takes a single clock to process any given ALU instruction, then one instruction can be provided with each clock. Software can load the appropriate general purpose registers (GPRs) with appropriate values before programming the command streamer with the math command.
In the described example, the math command is referred to as an MI_MATH command. However any similar command can be used instead. The MI_MATH command allows software to send instructions to an ALU in a render command streamer. The MI_MATH command is a means by which the ALU can be accessed by the CPU to perform general purpose operations, that is operations that are not a part of graphics rendering. The MI_MATH command contains headers and a payload. The instructions for the ALU can form the data payload.
In some embodiments of the invention, ALU instructions are all a dword (double word) in size. The MI_MATH dword length can be programmed based on the number of ALU instructions packed into the payload so that the maximum number of instructions is limited by the maximum dword length supported. When the MI_MATH command is parsed by a command streamer, the command streamer outputs the payload dwords (ALU instructions) to the ALU.
The MI_MATH command, along with register and memory modifying commands in the command streamer, provide a powerful construct for 3D and GPGPU drivers in building self modifying command buffers (computing/storing thread/vertex counts, etc.). There are many applications for self modifying command buffers. This combination also provides an ability for software to carry generic computations in the front end of the 3D or GPGPU pipe without having back and forth transfers between the GPU and the CPU for the same command stream.
Table 1 shows example parameters for the MI_MATH command. This command or a similar command can be used to load up the batch buffer of the ALU with instructions. The package of instructions may take several clock cycles to complete even with parallel processing.
Programming Sequence
The operations described above can be represented as a programming sequence or a flow chart as described below. In the example of the programming sequence below, the commands described above are used and these are commented in this example.
Flow Sequence
The operations of the programming sequence above can be represented as flow charts as shown in the process flow diagrams of
If predication is not enabled, then the process skips to block 16 to execute the batch buffer. After the execution, the process returns to receive a new batch buffer command. On the other hand, if predication is enabled, then the process checks the predication condition at block 14. In one example, the predication condition is a set/not set condition. If the value in the predication register is set, then the command is executed, if the value is not set, then the command is not executed. In another embodiment, the predication condition requires that an operation be performed against the predication register such as a greater than, less than, equal to, etc. If the condition is met, then the command is executed. If the condition is not met, then the command is not executed.
In the context of
The process flow of
At block 26, it is determined whether the results of the command execution are available in a general purpose register. This register may be identified, for example, in the MI_MATH command mentioned above, another execution command or in the start command. If the results are not available, then the process ends and returns to load new values into buffer registers. If the results are available, then the available results are loaded into the predicate register at block 28. The process then returns to load the buffer with the new batch of register values at block 20.
The operations of
Arithmetic Logic Unit:
Referring to
Unit) 111 in a graphics hardware command streamer 101 is used. The ALU can be exercised by software using, for example, the MI_MATH command described above. The output 113 of the ALU can be stored in any MMIO register which can be read from or written to by hardware or software, such as REG0 to REG15. After executing the MI_MATH command, the contents of any MMIO register can be moved to any other MMIO register or to a location in GPU memory (not shown).
The ALU (arithmetic and logic unit) supports arithmetic (addition and subtraction) and logical operations (AND, OR, XOR) on two 64 bit operands. The ALU has two 64-bit registers at source input A (SRCA) and source input B (SRCB) to which the operands should be loaded. The ALU performs operations on the contents of the SRCA and SRCB registers based on the ALU instructions supplied at 117 and the output is sent to a 64-bit Accumulator 119. A zero flag and a carry flag 121 reflect the accumulator status to each register. The command streamer 101 implements sixteen 64-bit general purpose registers REG0 to REG15 which are MMIO mapped. These registers can be accessed similar to any other GPU MMIO mapped register. Any selected GPR register can be moved to the SRCA or SRCB register using a “LOAD” instruction 123. Outputs of the ALU (Accumulator, ZF and CF) can be moved to any of the GPRs using a “STORE” instruction 125. Any of the GPRs can be moved to any of the other GPU registers using existing techniques, for example, an MI_LOAD_REGISTER_REG GPU instruction. GPR values can be moved to any memory location using existing techniques, for example, an MI_LOAD_REGISTER_MEM command. This gives complete flexibility for the use of the output of the ALU.
Table 2 shows an example set of commands that can be programmed into the ALU using, for example, the MI_MATH command. In Table 2, the operation code, bits 20-31, indicates the operation or function to be performed, while the operands, bits 0-19, are the operands upon which the operation code operates. The commands use 32 bits as identified in Table 2.
Referring further to
Each of the registers can be connected to either of two multiplexers (muxes) 131-1131-2. The multiplexers determine which values are applied to the source registers 133-1133-2 which supply the values SRCA and SRCB, as described above. The load command 123 is applied to these two muxes to load values into the SRCA and the SRCB registers. Using this architecture any of the values in any of the general purpose registers can be applied as the inputs to the ALU 111. As each clock pulse is applied, different combinations of store, load, and ALU operations can be applied to the system to create different arithmetic and logical functions.
The ALU architecture of
The graphic subsystem 221 contains a unified return buffer 223 coupled to an array of graphics processing cores 225. The unified return buffer contains memory that is that is shared by various functions to allow threads to return data that later will be consumed by other functions or threads. The array of cores 225 process the values from the pipeline streamers to eventually produce destination surfaces. 227 The array of cores has access to sampler functions 229, math functions 231, inter-thread communications 233, color calculators 235, and a render cache 237 to cache finally rendered surfaces. A set of source surfaces 239 is applied to the graphics subsystem 221 and after all of these functions 229, 231, 235, 237, 239 are applied by the array of cores, a set of destination surfaces 227 is produced. For purposes of general purpose calculations, the command streamer 211 and ALU are used to run operations to only the ALU or also through the array of cores 225, depending on the particular implementation.
Referring to
The input/output controller hub 505 includes connections to mass storage 531, external peripheral devices 533, and user input/output devices 535, such as a keyboard and mouse. The input/output controller hub may also include a display interface 537 and other additional interfaces. The display interface 537 is within a video processing subsystem 539. The subsystem may optionally be coupled through a display link 541 to the graphics core of the CPU.
A wide range of additional and alternative devices may be coupled to the computer system 501 shown in
Embodiments of the present invention provide a mechanism in a command streamer to skip any command buffer depending on value set in a register. In the described example, this is the predicate enable bit set in a PR_RESULT—1 register, however, the invention is not so limited. This provides a hardware mechanism in a command streamer, a hardware structure, to perform arithmetic and logical operations by means of a command, here the MI_MATH command, programmed in the command buffer or ring buffer. The output of the computed expression can be stored to any MMIO register. This enables a driver to evaluate any arbitrary condition involving arithmetic and logical expressions in hardware on the fly by programming MI_MATH appropriately in the command buffer or ring buffer. The evaluated output of the computed result may be moved to the PR_RESULT—1 register. If a predicate enable bit is set, then the evaluated output may be used to predicate the subsequent command buffers.
3D and GPGPU drivers may use embodiments of the present invention to accelerate the rate at which command buffers can be dispatched to a GPU by avoiding the long bubbles in hardware between consecutive dispatches. Avoiding these delays results in a performance boost. In addition, running a 3D or GPU driver, can save CPU power because of improved use of the CPU.
A wide range of additional and alternative devices may be coupled to the computer system 501 shown in
It is to be appreciated that a lesser or more equipped system than the examples described above may be preferred for certain implementations. Therefore, the configuration of the exemplary systems and circuits may vary from implementation to implementation depending upon numerous factors, such as price constraints, performance requirements, technological improvements, or other circumstances.
Embodiments may be implemented as any or a combination of: one or more microchips or integrated circuits interconnected using a motherboard, hardwired logic, software stored by a memory device and executed by a microprocessor, firmware, an application specific integrated circuit (ASIC), and/or a field programmable gate array (FPGA). The term “logic” may include, by way of example, software or hardware and/or combinations of software and hardware.
References to “one embodiment”, “an embodiment”, “example embodiment”, “various embodiments”, etc., indicate that the embodiment(s) of the invention so described may include particular features, structures, or characteristics, but not every embodiment necessarily includes the particular features, structures, or characteristics. Further, some embodiments may have some, all, or none of the features described for other embodiments.
In the following description and claims, the term “coupled” along with its derivatives, may be used. “Coupled” is used to indicate that two or more elements co-operate or interact with each other, but they may or may not have intervening physical or electrical components between them.
As used in the claims, unless otherwise specified, the use of the ordinal adjectives “first”, “second”, “third”, etc., to describe a common element, merely indicate that different instances of like elements are being referred to, and are not intended to imply that the elements so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown;
nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
3862/DEL/2011 | Dec 2011 | IN | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US12/70395 | 12/18/2012 | WO | 00 | 12/20/2013 |