A graphics processing unit (GPU) is a complex integrated circuit that is configured to perform graphics-processing tasks. For example, a GPU can execute graphics-processing tasks required by an end-user application, such as a video-game application. GPUs are also increasingly being used to perform other tasks which are unrelated to graphics. The GPU can be a discrete device or can be included in the same device as another processor, such as a central processing unit (CPU).
In many applications, such as graphics processing in a GPU, a sequence of work-items, which can also be referred to as threads, are processed so as to output a final result. In many modern parallel processors, for example, processors within a single instruction multiple data (SIMD) core synchronously execute a set of work-items. A plurality of identical synchronous work-items that are processed by separate processors are referred to as a wavefront or warp.
During processing, one or more SIMD cores concurrently execute multiple wavefronts. Execution of the wavefront terminates when all work-items within the wavefront complete processing. Each wavefront includes multiple work-items that are processed in parallel, using the same set of instructions. In some cases, the number of work-items in a wavefront does not match the number of execution units of the SIMD cores. In one embodiment, each execution unit of a SIMD core is an arithmetic logic unit (ALU). When the number of work-items in a wavefront does not match the number of execution units of the SIMD cores, determining how to schedule instructions for execution can be challenging.
The advantages of the methods and mechanisms described herein may be better understood by referring to the following description in conjunction with the accompanying drawings, in which:
In the following description, numerous specific details are set forth to provide a thorough understanding of the methods and mechanisms presented herein. However, one having ordinary skill in the art should recognize that the various embodiments may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the approaches described herein. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.
Various systems, apparatuses, methods, and computer-readable media for processing variable wavefront sizes on a processor are disclosed. When operating in a first mode, the processor executes the same instruction on multiple portions of a wavefront before proceeding to the next instruction of the shader program. When operating in a second mode, the processor executes a set of instructions on a first portion of a wavefront and when the processor finishes executing the set of instructions on the first portion of the wavefront, the processor executes the set of instructions on a second portion of the wavefront, and so on until all portions of the wavefront have been processed. Then, the processor continues executing subsequent instructions of the shader program.
In one embodiment, an indication is declared within the code sequence, with the indication specifying which mode to utilize for a given region of the program. In another embodiment, a compiler generates the indication when generating executable code, with the indication specifying the processor operating mode. In another embodiment, the processor includes a control unit which determines the processor operating mode.
Referring now to
In one embodiment, processing units 115A-N are configured to execute instructions of a particular instruction set architecture (ISA). Each processing unit 115A-N includes one or more execution units, cache memories, schedulers, branch prediction circuits, and so forth. In one embodiment, the processing units 115A-N are configured to execute the main control software of system 100, such as an operating system. Generally, software executed by processing units 115A-N during use can control the other components of system 100 to realize the desired functionality of system 100. Processing units 115A-N can also execute other software, such as application programs.
GPU 130 includes at least compute units 145A-N which are representative of any number and type of compute units that are used for graphics or general-purpose processing. Compute units 145A-N can also be referred to as “shader arrays”, “shader engines”, “single instruction multiple data (SIMD) units”, or “SIMD cores”. Each compute unit 145A-N includes a plurality of execution units. GPU 130 is coupled to shared caches 120A-B and fabric 125. In one embodiment, GPU 130 is configured to execute graphics pipeline operations such as draw commands, pixel operations, geometric computations, and other operations for rendering an image to a display. In another embodiment, GPU 130 is configured to execute operations unrelated to graphics. In a further embodiment, GPU 130 is configured to execute both graphics operations and non-graphics related operations.
GPU 130 is configured to receive instructions of a shader program and wavefronts for execution. In one embodiment, GPU 130 is configured to operate in different modes. In one embodiment, the number of work-items in each wavefront is greater than the number of execution units in GPU 130.
In one embodiment, GPU 130 schedules a first instruction for execution on first and second portions of a first wavefront prior to scheduling a second instruction for execution on the first portion of the first wavefront responsive to detecting a first indication. GPU 130 follows this pattern for the other instructions of the shader program and for other wavefronts as long as the first indication is detected. It is noted that “scheduling an instruction” can also be referred to as “issuing an instruction”. Depending on the embodiment, the first indication can be specified in software, or the first indication can be generated by GPU 130 based on one or more operating conditions. In one embodiment, the first indication is a command for GPU 130 to operate in a first mode.
In one embodiment, GPU 130 schedules the first instruction and the second instruction for execution on the first portion of the first wavefront prior to scheduling the first instruction for execution on the second portion of the first wavefront responsive to not detecting the first indication. GPU 130 follows this pattern for the other instructions of the shader program and for other wavefronts as long as the first indication is not detected.
I/O interfaces 110 are coupled to fabric 125, and I/O interfaces 110 are representative of any number and type of interfaces (e.g., peripheral component interconnect (PCI) bus, PCI-Extended (PCI-X), PCIE (PCI Express) bus, gigabit Ethernet (GBE) bus, universal serial bus (USB)). Various types of peripheral devices can be coupled to I/O interfaces 110. Such peripheral devices include (but are not limited to) displays, keyboards, mice, printers, scanners, joysticks or other types of game controllers, media recording devices, external storage devices, network interface cards, and so forth.
SoC 105 is coupled to memory 150, which includes one or more memory modules. Each of the memory modules includes one or more memory devices mounted thereon. In some embodiments, memory 150 includes one or more memory devices mounted on a motherboard or other carrier upon which SoC 105 is also mounted. The RAM implemented can be static RAM (SRAM), dynamic RAM (DRAM), Resistive RAM (ReRAM), Phase Change RAM (PCRAM), or any other volatile or non-volatile RAM. The type of DRAM that is used to implement memory 150 includes (but is not limited to) double data rate (DDR) DRAM, DDR2 DRAM, DDR3 DRAM, and so forth. Although not explicitly shown in
In various embodiments, computing system 100 can be a computer, laptop, mobile device, server or any of various other types of computing systems or devices. It is noted that the number of components of computing system 100 and/or SoC 105 can vary from embodiment to embodiment. There can be more or fewer of each component/subcomponent than the number shown in
Turning now to
In one embodiment, GPU 200 is configured to operate in different modes to process instructions of a shader program on wavefronts of different sizes. GPU 200 utilizes a given mode to optimize performance, power consumption, and/or other factors depending on the type of workload being processed and/or the number of work-items in each wavefront. In one embodiment, each wavefront includes a number of work-items which is greater than the number of lanes 215A-N, 220A-N, and 225A-N in SIMDs 210A-N. In this embodiment, GPU 200 processes the wavefronts differently based on the operating mode of GPU 200. In another embodiment, GPU 200 processes the wavefronts differently based on one or more detected conditions. Each lane 215A-N, 220A-N, and 225A-N of SIMDs 210A-N can also be referred to as an “execution unit”.
In one embodiment, GPU 200 receives a plurality of instructions for a wavefront with a number of work-items which is greater than the total number of lanes in SIMDs 210A-N. In this embodiment, GPU 200 executes a first instruction on multiple portions of a wavefront before proceeding to the second instruction when GPU 200 is in a first mode. GPU 200 continues with this pattern of execution for subsequent instructions for as long as GPU 200 is in the first mode. In one embodiment, the first mode can be specified by a software-generated declaration. If GPU 200 is in a second mode, then GPU 200 executes multiple instructions on a first portion of the wavefront before proceeding to the second portion of the wavefront. When GPU 200 is in the second mode, GPU 200 shares a portion of vector general purpose registers (VGPRs) 230A-N between different portions of the wavefront. Additionally, when GPU 200 is in the second mode, if an execution mask of mask(s) 250 indicates that a given portion of the wavefront is temporarily masked out, then GPU 200 does not execute the instruction for the given portion of the wavefront.
In another embodiment, if the wavefront size is greater than the number of SIMDs, GPU 200 determines the cache miss rate of cache 260 for the program. If the cache miss rate is less than a threshold, then GPU 200 executes a first instruction on multiple portions of a wavefront before proceeding to the second instruction. GPU 200 continues with this pattern of execution for subsequent instructions for as long as the cache miss rate is determined or predicted to be less than the threshold. The threshold can be specified as a number of bytes, as a percentage of cache 260, or as any other suitable metric. If the cache miss rate is greater than or equal to the threshold, then GPU 200 executes multiple instructions on a first portion of the wavefront before executing the multiple instructions on the second portion of the wavefront. Additionally, if the cache miss rate is greater than or equal to the threshold, GPU 200 shares a portion of vector general purpose registers (VGPRs) 230A-N between different portions of the wavefront, and GPU 200 skips instructions for the given portion of the wavefront if the execution mask indicates the given portion is masked out.
It is noted that the letter “N” when displayed herein next to various structures is meant to generically indicate any number of elements for that structure (e.g., any number of SIMDs 210A-N). Additionally, different references within
Referring now to
As shown in
In one embodiment, if the host GPU (e.g., GPU 200) is in a first mode, then shared VGPRs 315 and shared VGPRs 310 are not shared between different portions of the first and second wavefronts, respectively. However, when the host GPU is in a second mode, then shared VGPRs 315 and shared VGPRs 310 are shared between different portions of the first and second wavefronts, respectively. In other embodiments, the implementation of sharing or not sharing is based on the detection of a first indication rather than being based on a first mode or second mode. The first indication can be generated by software, generated based on cache miss rate, or generated based on one or more other operating conditions.
Turning now to
In one embodiment, N is 32, and the number of work-items per wavefront is 64. In other embodiments, N can be other values. In the embodiment when N is 32, vector unit 420 also includes 32 lanes which are shown as lanes 425A-N. In other embodiments, vector unit 420 can include other numbers of lanes.
Instruction sequence 410 is illustrative of one example of an instruction sequence. As shown in
Referring now to
In a first mode of operation, each instruction is executed on different subsets of the wavefront before the next instruction is executed on the different subsets. For example, instruction 415A is executed for the first half (i.e., work-items W0 through WN-1) of the wavefront during a first instruction cycle on lanes 425A-N of vector unit 420, and then instruction 415A is executed for the second half (i.e., work-items WN through W2N-1) of the first wavefront during a second instruction cycle on lanes 425A-N. For example, during the first instruction cycle, work-item W0 can execute on lane 425A, work-item W1 can execute on lane 425B, and so on.
Then, instruction 415B is executed for the first half of the wavefront during a third instruction cycle on lanes 425A-N, and then instruction 415B is executed for the second half of the wavefront during a fourth instruction cycle on lanes 425A-N. Next, instruction 415C is executed for the first half of the wavefront during a fifth instruction cycle on lanes 425A-N, and then instruction 415C is executed for the second half of the wavefront during a sixth instruction cycle on lanes 425A-N. Then, instruction 415D is executed for the first half of the wavefront on lanes 425A-N of vector unit 420 during a seventh instruction cycle on lanes 425A-N, and then instruction 415D is executed for the second half of the wavefront on lanes 425A-N of vector unit 420 during an eighth instruction cycle on lanes 425A-N. For the purposes of this discussion, it can be assumed that the second instruction cycle follows the first instruction cycle, the third instruction cycle follows the second instruction cycle, and so on.
Turning now to
For example, in a first instruction cycle, instruction 415A is executed for the first half (i.e., work-items W0 through WN-1) of the wavefront on lanes 425A-N of vector unit 420. Then, in a second instruction cycle, instruction 415B is executed for the first half of the wavefront on lanes 425A-N. Next, in a third instruction cycle, instruction 415C is executed for the first half of the wavefront on lanes 425A-N. Then, in a fourth instruction cycle, instruction 415D is executed for the first half of the wavefront on lanes 425A-N.
Next, instruction sequence 410 is executed on the second half of the wavefront. Accordingly, in a fifth instruction cycle, instruction 415A is executed for the second half (i.e., work-items WN through W2N-1) of the wavefront on lanes 425A-N of vector unit 420. Then, in a sixth instruction cycle, instruction 415B is executed for the second half of the wavefront on lanes 425A-N. Next, in a seventh instruction cycle, instruction 415C is executed for the second half of the wavefront on lanes 425A-N. Then, in an eighth instruction cycle, instruction 415D is executed for the second half of the wavefront on lanes 425A-N.
In another embodiment, if a wavefront had 4*N work-items, instruction sequence 410 could be executed on the first quarter of the wavefront, then instruction sequence 410 could be executed on the second quarter of the wavefront, followed by the third quarter and then the fourth quarter of the wavefront. Other wavefronts of other sizes and/or vector units with other numbers of lanes could be utilized in a similar manner for the second mode of operation.
Referring now to
A processor receives a wavefront and a plurality of instructions of a shader program for execution (block 705). In one embodiment, the processor includes at least a plurality of execution units, a scheduler, a cache, and a plurality of GPRs. In one embodiment, the processor is a GPU. In other embodiments, the processor is any of various other types of processors ((e.g., DSP, FPGA, ASIC, multi-core processor). In one embodiment, the number of work-items in the wavefront is greater than the number of execution units of the processor. For example, in one embodiment, the wavefront includes 64 work-items and the processor includes 32 execution units. In this embodiment, the number of work-items in the wavefront is equal to twice the number of execution units. In other embodiments, the wavefront can include other numbers of work-items and/or the processor can include other numbers of execution units. In some cases, the processor receives a plurality of wavefronts for execution. In these cases, method 700 can be implemented multiple times for the multiple wavefronts.
Next, the processor determines if a first indication has been detected (conditional block 710). In one embodiment, the first indication is a setting or parameter declared within a software instruction, with the setting or parameter specifying the operating mode for the processor to utilize. In another embodiment, the first indication is generated based on a cache miss rate of the wavefront. In other embodiments, other types of indications are possible and are contemplated.
If the first indication is detected (conditional block 710, “yes” leg), then the processor schedules the plurality of execution units to execute a first instruction on first and second portions of a wavefront prior to scheduling the plurality of execution units to execute a second instruction on the first portion of the wavefront (block 715). The processor can follow this same pattern of scheduling instruction for the remainder of the plurality of instructions, as long as the first indication is detected. If the first indication is not detected (conditional block 710, “no” leg), then the processor schedules the plurality of execution units to execute the first instruction and the second instruction on the first portion of the wavefront prior to scheduling the plurality of execution units to execute the first instruction on the second portion of the wavefront (block 720). The processor can follow this same pattern of scheduling instruction for the remainder of the plurality of instructions, as long as the first indication is not detected. Also, the processor shares a portion of the GPRs between the first portion of the wavefront and the second portion of the wavefront if the first indication is not detected (block 725). After blocks 715 and 725, method 700 ends.
Turning now to
If the cache miss rate of the wavefront is less than a threshold (conditional block 810, “yes” leg), then the processor utilizes a first mode of operation when processing the wavefront (block 815). In one embodiment, the first mode of operation involves issuing each instruction on all portions of the wavefront before moving on to the next instruction in the shader program. If the cache miss rate of the wavefront is greater than or equal to the threshold (conditional block 810, “no” leg), then the processor utilizes a second mode of operation when processing the wavefront (block 820). In one embodiment, the second mode of operation involves executing a set of instructions on a first portion of the wavefront, then executing the same set of instructions on a second portion of the wavefront, and so on, until all portions of the wavefront have been processed. After blocks 815 and 820, method 800 ends.
Referring now to
If the control unit selects a first operating mode (conditional block 910, “first” leg), then the processor does not share registers between different subsets of the wavefront being processed by the processor (block 915). Otherwise, if the control unit selects a second operating mode (conditional block 910, “second” leg), then the control unit shares one or more registers between different subsets of the wavefront being processed by the processor (block 920). For example, in one embodiment, sharing registers involves the processor using a shared portion of a register file for a first portion of a wavefront for a first set of instructions. Then, the processor reuses the shared portion of the register file for a second portion of the wavefront. If the wavefront has more than two portions, then the processor reuses the shared portion of the register file for the additional portions of the wavefront. After block 920, method 900 ends.
It is noted that in some embodiments, a processor can have more than two operating modes. In these embodiments, conditional block 910 can be applied such that a first subset (e.g., first mode, third mode, seventh mode) of operating modes follow the “first” leg and a second subset (e.g., second mode, fourth mode, fifth mode, sixth mode) of operating modes follow the “second” leg shown in
Referring now to
Referring to
Software application 1004 includes one or more sets of executable instructions 1006 as well as one or more shader programs 1008. The set of executable instructions 1006 represent one or more programs that have been compiled into machine language code suitable for execution at the processing unit 115. Each shader program 1008 (also commonly known as a “compute kernel” or simply a “shader”) is a program representing a task or workload intended to be executed at least partially by GPU 130, and typically with multiple instances of shader program 1008 being executed in parallel by two or more of compute units 145 (
OS 1002 includes an OS kernel 1010, one or more kernel-mode drivers 1012, one or more application programming interfaces (APIs) 1014, and one or more user-mode drivers 1016. OS kernel 1010 represents the functional core of OS 1002 and is responsible for boot initialization, memory allocation/deallocation, input/output control, and other fundamental hardware controls, as well as facilitating execution of software application 1004. Kernel-mode driver 1012 manages the general operation of the hardware of GPU 130, including initialization of GPU 130, setting display modes, managing mouse hardware, managing allocation/deallocation of physical memory for GPU 130, managing the command buffer (not shown) in the system memory 150 that facilitate tasking of commands from processing unit 115 to GPU 130, and the like.
User-mode driver 1016 operates as the interface to GPU 130 for one or more shader programs 1008 of software application 1004. However, to facilitate hardware abstraction, shader program 1008 typically is not implemented in software application 1004 as machine readable code (i.e., “native” code), but rather as source code (that is, in a human readable syntax), such as OpenGL (TM) Shading Language (GLSL) or High Level Shading Language (HLSL) syntax, or in partially compiled bytecode, such as the Standard Portable Intermediate Representation (SPIR) bytecode format, and which rely on one or more APIs 1014, such as an OpenCL (TM) API, an OpenGL (TM) API, a Direct3D (TM) API, a CUDA (TM) API, and the like, and their associated libraries. As shader program 1008 is not in native code format, user-mode driver 1016 employs a shader compiler 1018 that operates to perform run time compilation (also known as real time compilation or just-in-time (JIT) compilation) of the source code or bytecode representation of shader program 1008 to machine readable code executable by GPU 130. In other embodiments, an offline compiler is employed to compile the code representing shader program 1008 into executable native code. The compiled executable code representation of shader program 1008 is then provided by the user-mode driver 1016 to GPU 130 via a command buffer (not shown) implemented in the system memory 150 and managed by, for example, the scheduler unit 245 (
Referring now to
As explained above, the number of work-items in a wavefront may outnumber the number of execution units in the SIMD cores of GPU 130, making scheduling of instructions of the wavefront difficult. Likewise, in some implementations, a wavefront may operate on a data structure, such as a vertex buffer or other array of structures, that is larger than the cache allocated to store the data of the data structure, which can result in significant cache thrashing and thus inefficient execution performance, when GPU 130 attempts to schedule an entire wavefront for execution in parallel. Thus, as explained above and further explained below, GPU 130 employs at least two different modes of operation, one of which involves GPU 130 executing each instruction for an entire wavefront in parallel (that is, the “regular mode” or “first mode”), and another mode in which the GPU 130 executes a set of instructions on a portion of a wavefront in parallel, then turns to execution of this same set of instructions on another portion of the wavefront, and so forth (that is, the “subvector looping mode” or “second mode”). In some embodiments, GPU 130 includes a controller or other hardware-based mechanism to independently identify when it would be advantageous to switch from the regular mode of operation to the subvector looping mode of operation, and vice versa. However, in other embodiments, this mode selection process is implemented instead at compile time, and thus providing software-implemented dynamic mode switching so as to better optimize execution of different sections of the shader program. Method 1100 of
As described below, in some embodiments, method 1100 relies on comparison of VGPR usage of each section of shader program 1008 to a specified VGPR usage to determine whether to implement the subvector looping mode to execute the corresponding section. Accordingly, as part of initialization, at block 1120, computing system 100 identifies the specified VGPR usage threshold to be employed. In one embodiment, the VGPR usage threshold has been previously identified and fixed, such as by setting the VGPR usage threshold as a constant in the code implementing the shader compiler 1018, by programming a fuse or one-time-programmable element, and the like. In other embodiments, the VGPR usage threshold is user-programmable or otherwise programmable, such as via programming of a guest-accessible register. In still other embodiments, the computing system 100 identifies or calculates the VGPR usage threshold dynamically for each iteration of method 1100. In any of these approaches, the VGPR usage may be set based on the maximum number of VGPRs available to each execution unit. For example, assume that each execution unit has available 12 dedicated VGPRs. In this example, the VGPR usage threshold may be set to equal to this number of dedicated VGPRs (that is, VGPR usage threshold=12 VGPRs), or set to some fixed percentage, and the like.
At block 1122, dynamic subvector modification state 1104 of shader compiler 1018 analyzes intermediate representation 1108 of shader program 1008 (or in some instances, shader program 1008 itself) to identify the expected VGPR usage for each section of shader program 1008 on a section-by-section basis. To this end, shader program 1008 (or intermediate representation 1108 thereof) can be logically segmented into a plurality of sections using any of a variety of criteria, or combination of criteria. For example, all of the instructions of a loop may be designated together as a single section, as may the instructions of a subroutine or function call. Similarly, for instructions not included in a loop or a subroutine, a section may be designated based on a specified number of instructions in sequence. In still other embodiments, a section may be identified as all instructions occurring between an instance where VGPR usage by shader program 1008 exceeds the VGPR usage threshold and the next instance where the VGPR usage then falls below the VGPR usage threshold (that is, sections may be defined relative to VGPR usage and the VGPR usage threshold). VGPR usage by shader program 1008 at any given point in the instruction sequence or for any given section may be determined using any of a variety of techniques. For example, in some embodiments, shader compiler 1018 determines VGPR usage using any of a variety of well-known or proprietary register “liveness” analysis techniques (also commonly referred to as “live variable analysis”).
At block 1124, shader compiler 1018 compares the VGPR usage for a selected section of the shader program 1008 with the VGPR usage threshold determined at block 1120. If the VGPR usage for the selected section exceeds the VGPR usage threshold, then at block 1126 shader compiler 1018 modifies the shader program 1008 so that when GPU 130 executes the resulting modified shader program (e.g., modified representation 1110 of shader program 1008), the GPU 130 is configured to operate in the subvector looping mode for execution of the instructions of the selected section.
As described below in more detail with reference to
At block 1128 shader compiler 1018 determines whether there are any further sections to consider. If so, shader compiler 1018 selects the next section of shader program 1008 and repeats the process of blocks 1124 and 1126 for the selected section. If not, then at block 1130 shader compiler 1018 issues the resulting modified shader program (as, for example, modified representation 1110) from the dynamic subvector modification stage 1104 to back end stage 1106 for further processing and generation of native code shader 1112 from the modified shader program.
Referring now to
As explained above, the modification of shader program 1008 so that a section with VGPR usage above the threshold is executed by GPU 130 using the subvector looping mode may be implemented in any of a variety of ways. Referring to
In the example of
This example demonstrates the modification performed when a single section with VGPR usage in excess of the VGPR usage threshold is to be executed between two sections with VGPR usage that does not exceed the VGPR usage threshold. In instances where multiple sections are to be executed in a row while GPU 130 is in the subvector looping mode, rather than insert a subvector looping mode entry instruction 1308 and a subvector looping mode exit instruction 1310 for each such section separately, the shader compiler 1018 instead may insert a single subvector looping mode entry instruction 1308 before the first instruction of the first section in this sequence and insert a single subvector looping mode exit instruction 1310 after the last instruction of the last section in this sequence. For example, as sections 3-5 each exceeds the VGPR usage threshold 1202, in generating the modified shader program shader compiler 1018 may insert a single subvector looping mode entry instruction 1308 before the first instruction of section 3 and insert a single subvector looping mode exit instruction 1310 after the last instruction of section 5. Moreover, although
Referring now to
Accordingly, in this example the shader compiler 1018 modifies the shader program 1008 to provide for execution of the set 1304 of instructions on only one portion of a wavefront at a time by replicating the set 1304 in the modified shader program for each portion of the wavefront, as well as inserting instructions to limit execution of each replicated set 1404 of instructions to a corresponding wavefront. To illustrate, as illustrated by block 1412 representing the resulting portion of the modified shader program, the set 1404 of instructions for section 7 is replaced by a replacement set 1414 of instructions, which includes two replicas of the set 1404 of instructions, denoted set 1404_L and set 1404_H in the resulting modified shader program. Replicated set 1404_H is inserted for the lower half of the wavefront and replicated set 1404_H is inserted for the higher half of the wavefront. As illustrated, replacement set 1414 of instructions further includes instructions that limit execution of the instructions of replicated set 1404_L to the first half of the wavefront and limit execution of the instructions of replicated set 1404_H to the second half of the wavefront.
Referring now to
In this example the shader compiler 1018 modifies the shader program 1008 to provide for execution of the set 1504 of instructions on only one portion of a wavefront at a time by including instructions that effectively place the set 1504 of instructions in a loop in the modified shader program, and which cause each iteration of the loop to be executed for a different portion of the same wavefront in sequence. To illustrate, as illustrated by block 1512 representing the resulting portion of the modified shader program, the set 1504 of instructions for section 7 is replaced by a replacement set 1414 of instructions, which includes instructions 1516 that place the set 1504 of instructions in a loop and which provide that each iteration of the loop is performed on a separate portion of the wavefront. As shown, these instructions 1516 include the function calls “S_SUBVECTOR_LOOP_END” and “S_SUBVECTOR_LOOP_BEGIN, which represents the following functions:
As such, if EXEC_HI=0, the set 1504 of instructions is executed only once: EXEC_LO is stored in S0 and restored at the end, but it is zero anyway. If EXEC_LO was zero at the start, the same result. If both halves of EXEC mask are non-zero, the low pass is performed first (storing EXECHI in S0), then EXECHI is restored and saved off EXECLO and the pass is performed again. EXECLO is then restored at the end of the second pass. The “pass #” is encoded by observing which half of EXEC is zero.
In some embodiments, certain aspects of the techniques described above may implemented by one or more processors of a processing system executing software. The software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium. The software can include the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. Such non-transitory computer readable storage media can include, for example, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc, magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media. The computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)). The executable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device may not be required, and that one or more further activities may be performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.
The present application is a Continuation-In-Part of U.S. patent application Ser. No. 15/439,540, filed on Feb. 22, 2017 and entitled “Variable Wavefront Size”, the entirety of which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
Parent | 15439540 | Feb 2017 | US |
Child | 16425625 | US |