Graphics processing units (GPUs) implement graphics processing pipelines that concurrently process copies of commands that are retrieved from a command buffer. The graphics pipeline includes one or more shaders that execute using resources of the graphics pipeline and one or more fixed function hardware blocks. The graphics pipeline is typically divided into a geometry portion that performs geometry operations on patches or other primitives such as triangles that are formed of vertices and edges and represent portions of an image. The shaders in the geometry portion can include vertex shaders, hull shaders, domain shaders, and geometry shaders. The geometry portion of the graphics pipeline completes when the primitives produced by the geometry portion of the pipeline are rasterized (e.g., by one or more scan converters) to form sets of pixels that represent portions of the image. Subsequent processing on the pixels is referred to as pixel processing and includes operations performed by shaders such as a pixel shader executing using resources of the graphics pipeline. GPUs and other multithreaded processing units typically implement multiple processing elements (which are also referred to as processor cores or compute units) that concurrently execute multiple instances of a single program on multiple data sets as a single wave. A hierarchical execution model is used to match the hierarchy implemented in hardware. The execution model defines a kernel of instructions that are executed by all the waves (also referred to as wavefronts, threads, streams, or work items).
The present disclosure is better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.
Shaders, such as a geometry shader in the geometry portion of the graphics pipeline of a GPU, launch waves that are processed by the shader. The results of the shader processing are passed to downstream entities such as other shaders in the pipeline. For example, a geometry shader wave generator launches waves using a greedy algorithm that attempts to use as many of the resources of the graphics pipeline as possible. The primitives processed by the geometry shader are passed to one or more scan converters that convert the primitives into pixels for processing in the pixel shader. Launching waves based on a greedy algorithm for processing in one shader can deprive downstream shaders of the resources needed to complete their operations on primitives or pixels. For example, the pixel shader may not be able to access resources of the graphics pipeline to perform shading on pixels received from the scan converter if the geometry shader wave generator launches too many waves and the geometry shader monopolizes the resources of the graphics pipeline. Some graphics pipelines are configured to limit the number of waves in-flight by constraining the number of compute units that can be allocated to the shaders for processing waves. However, a static limit on the number of available compute units typically reduces performance of the graphics pipeline when executing draw calls that require larger numbers of compute units.
The primitive hub provides feedback that indicates the fullness to a shader processor input (SPI), which selectively throttles the geometry shader waves based on resource usage of the geometry shader and the pixel shader. Some embodiments of the SPI determine the relative allocation of local data store (LDS) resources to in-flight geometry shader waves and in-flight pixel shader waves, the relative allocation of registers such as vector general-purpose registers (VGPRs) to the in-flight geometry shader waves and in-flight pixel shader waves, or a combination thereof. The SPI increments the stall counter in response to the relative allocation of resources to the in-flight geometry shader waves and in-flight pixel shader waves exceeding a threshold that indicates that the in-flight geometry shader waves are consuming resources that prevent the in-flight pixel shader waves from being processed. In some embodiments, the value of the stall counter is also determined based on lifetimes of geometry shader waves in one or more geometry shader groups so that the stall counter is incremented if the lifetimes exceed a threshold.
The techniques described herein are, in different embodiments, employed at any of a variety of parallel processors (e.g., vector processors, graphics processing units (GPUs), general-purpose GPUs (GPGPUs), non-scalar processors, highly-parallel processors, artificial intelligence (AI) processors, inference engines, machine learning processors, other multithreaded processing units, and the like).
The processing system 100 also includes a central processing unit (CPU) 130 that is connected to the bus 110 and therefore communicates with the GPU 115 and the memory 105 via the bus 110. The CPU 130 implements a plurality of processor cores 131, 132, 133 (collectively referred to herein as “the processor cores 131-133”) that execute instructions concurrently or in parallel. The number of processor cores 131-133 implemented in the CPU 130 is a matter of design choice and some embodiments include more or fewer processor cores than illustrated in
An input/output (I/O) engine 145 handles input or output operations associated with the display 120, as well as other elements of the processing system 100 such as keyboards, mice, printers, external disks, and the like. The I/O engine 145 is coupled to the bus 110 so that the I/O engine 145 communicates with the memory 105, the GPU 115, or the CPU 130. In the illustrated embodiment, the I/O engine 145 reads information stored on an external storage component 150, which is implemented using a non-transitory computer readable medium such as a compact disk (CD), a digital video disc (DVD), and the like. The I/O engine 145 is also able to write information to the external storage component 150, such as the results of processing by the GPU 115 or the CPU 130.
The processing system 100 implements pipeline circuitry for executing instructions in multiple stages of the pipeline. The pipeline circuitry is implemented in some embodiments of the compute units 121-123 or the processor cores 131-133. In some embodiments, the pipeline circuitry is used to implement a graphics pipeline that executes shaders of different types including, but not limited to, the vertex shaders, hull shaders, domain shaders, geometry shaders, and pixel shaders. The pipeline circuitry also includes buffers that hold primitives generated by the shaders. In some embodiments, one or more buffers hold primitives generated by the geometry shader and then provide these primitives to a pixel shader. The pipeline circuitry also includes a primitive hub that monitors fullness of the buffers. Launching of waves from the geometry shader is throttled based on the fullness of the buffers. A shader processor input (SPI) selectively throttles the waves launched by the geometry shader based on a signal from the primitive hub indicating the fullness, an indication of relative resource usage of geometry waves and pixel waves in the graphics pipeline, or an indication of lifetimes of the geometry waves.
The graphics pipeline 200 has access to storage resources 205 such as a hierarchy of one or more memories or caches that are used to implement buffers and store vertex data, texture data, and the like. In the illustrated embodiment, the storage resources 205 include local data store (LDS) 206 circuitry that is used to store data and vector general-purpose registers (VGPRs) to store register values used during rendering by the graphics pipeline 200. The storage resources 205 are implemented using some embodiments of the system memory 105 shown in
An input assembler 210 accesses information from the storage resources 205 that is used to define objects that represent portions of a model of a scene. An example of a primitive is shown in
A vertex shader 215, which is implemented in software in the illustrated embodiment, logically receives a single vertex 212 of a primitive as input and outputs a single vertex. Some embodiments of shaders such as the vertex shader 215 implement massive single-instruction-multiple-data (SIMD) processing so that multiple vertices are processed concurrently. The graphics pipeline 200 implements a unified shader model so that all the shaders included in the graphics pipeline 200 have the same execution platform on the shared massive SIMD compute units. The shaders, including the vertex shader 215, are therefore implemented using a common set of resources that is referred to herein as the unified shader pool 216.
A hull shader 218 operates on input high-order patches or control points that are used to define the input patches. The hull shader 218 outputs tessellation factors and other patch data. In some embodiments, primitives generated by the hull shader 218 are provided to a tessellator 220. The tessellator 220 receives objects (such as patches) from the hull shader 218 and generates information identifying primitives corresponding to the input object, e.g., by tessellating the input objects based on tessellation factors provided to the tessellator 220 by the hull shader 218. Tessellation subdivides input higher-order primitives such as patches into a set of lower-order output primitives that represent finer levels of detail, e.g., as indicated by tessellation factors that specify the granularity of the primitives produced by the tessellation process. A model of a scene is therefore represented by a smaller number of higher-order primitives (to save memory or bandwidth) and additional details are added by tessellating the higher-order primitive.
A domain shader 224 inputs a domain location and (optionally) other patch data. The domain shader 224 operates on the provided information and generates a single vertex for output based on the input domain location and other information. In the illustrated embodiment, the domain shader 224 generates primitives 222 based on the triangles 211 and the tesselation factors. A geometry shader 226 receives an input primitive and outputs up to four primitives that are generated by the geometry shader 226 based on the input primitive. In the illustrated embodiment, the geometry shader 226 generates the output primitives 228 based on the tessellated primitive 222.
One stream of primitives is provided to one or more scan converters 230 and, in some embodiments, up to four streams of primitives are concatenated to buffers in the storage resources 205. The scan converters 230 perform shading operations and other operations such as clipping, perspective dividing, scissoring, and viewport selection, and the like. The scan converters 230 generate a set 232 of pixels that are subsequently processed in the pixel processing portion 202 of the graphics pipeline 200.
In the illustrated embodiment, a pixel shader 234 inputs a pixel flow (e.g., including the set 232 of pixels) and outputs zero or another pixel flow in response to the input pixel flow. An output merger block 236 performs blend, depth, stencil, or other operations on pixels received from the pixel shader 234.
Some or all the shaders in the graphics pipeline 200 perform texture mapping using texture data that is stored in the storage resources 205. For example, the pixel shader 234 can read texture data from the storage resources 205 and use the texture data to shade one or more pixels. The shaded pixels are then provided to a display for presentation to a user.
A primitive hub 325 receives primitives from the PA 321-323 and distributes the primitives to scan converters 331, 332, 333, which are collectively referred to herein as “the scan converters 331-333.” Some embodiments of the primitive hub 325 include a buffer complex (not shown in
Some embodiments of the SPI 301-303 collect data that indicates resource usage by shaders including a geometry shader (such as the geometry shader 226 shown in
The data collection logic has two usage modes that are controlled by a parameter value accessible and modifiable by the SPI:
The portion 300 of the graphics pipeline can hide to some latency in groups of waves launched from the geometry shader. However, if the actual lifetime of the waves (or the corresponding groups) exceeds this value, the performance of the graphics pipeline declines and the geometry shader wave groups begin blocking resources for longer durations. Thus, if pixel shader waves are starved for resources, a geometry shader group with a longer lifetime potentially creates a longer stall for pixels than a geometry shader group with a shorter lifetime. Some embodiments of the SPI 301-303 therefore monitor the lifetimes of geometry shader groups and compare the lifetimes to a threshold. The SPI 301-303 generate longer stalls to throttle wave launches from the geometry shader in response to the lifetimes of the geometry shader groups exceeding the threshold.
A primitive hub 405 includes sets 411, 412, 413 of buffers 415 (only one indicated by a reference numeral in the interest of clarity) and each of the sets 411-413 is associated with a corresponding scan converter 421, 422, 423, which are collectively referred to herein as “the scan converters 421-423.” The primitive hub 405 receives the primitives from the PA 401-403 and stores copies of the primitives in corresponding buffers in each of the sets 411-413. The primitive hub 405 also monitors fullness of the buffers 415 and determines whether to throttle wave launch based on the fullness. In some embodiments, polling logic 425 in the primitive hub 405 polls the buffers 415 in the set 411-413 to determine their fullness at programmed time intervals such as every thousand clock cycles. A rate limiter 430 in the primitive hub 405 increments a number of dead cycles that is used to throttle wave launch for the geometry shader. In some embodiments, the rate limiter 430 uses a first value to indicate the number of dead cycles to be added on each increment and a second value that indicates the incremental steps. Thus, on every increment, the number of dead cycles is incremented by the second value and on every decrement pulse, the dead cycles are reduced by the second value.
The portion 400 of the graphics pipeline includes counters 435 that indicate how many dead cycles are used to selectively throttle wave launch. Some embodiments of the counters 435 are implemented in the corresponding SPI such as the SPI 301-303 shown in
At block 505, a primitive hub monitors buffer fullness for a set of FIFO buffers that receive data from one or more primitive assemblers and provide the data to one or more scan converters for rasterization. At block 510, the primitive hub generates a status signal based on the buffer fullness. As discussed herein, the status signal can include a set of bits (e.g., two bits) that have values indicating different ranges of buffer fullness.
At block 515, the primitive hub provides the status signal to one or more SPIs. At block 520, a counter value is determined based on the status signal. For example, the counter value can be given a value that is determined based on the range of buffer fullness indicated by the status signal so that the counter value is incremented by a larger amount if the buffer fullness is larger. As discussed herein, selective throttling of geometry waves is performed using the counter value determined based on the buffer fullness in conjunction with counter values determined based on relative resource usage of geometry shader waves and pixel shader waves and counter values that are determined based on lifetimes of the geometry shader waves or groups thereof.
At block 605, an SPI monitors resource usage by geometry shader waves and pixel shader waves. In the illustrated embodiment, the SPI monitors LDS usage, VGPR usage, or a combination thereof by the geometry shader waves and the pixel shader waves. At block 610, the SPI determines a relative resource allocation to the geometry and pixel shader waves based on the LDS usage, the VGPR usage, or a combination thereof, as discussed herein.
At decision block 615, the SPI determines whether the relative allocation is above a threshold. If so, the method 600 flows to the block 620 and the SPI increments the value that is used to set the counter for selectively throttling launch of geometry shader waves. If the relative allocation is not above the threshold, the method 600 flows to the block 625 and the SPI maintains the counter at its current value. As discussed herein, selective throttling of geometry waves is performed using the counter value determined based on the relative resource usage of the geometry shader waves and the pixel shader waves in conjunction with counter values determined based on buffer fullness at a primitive hub and counter values that are determined based on lifetimes of the geometry shader waves or groups thereof.
At block 705, a geometry shader wave (or a group of geometry shader waves) are launched in a graphics pipeline. At block 710, an SPI determines a lifetime of the geometry shader wave (or the group), as discussed herein. At decision block 715, the SPI determines whether the lifetime is above a threshold. If so, the method 700 flows to the block 720 and the SPI increments the value that is used to set the counter for selectively throttling launch of geometry shader waves. If the lifetime is not above the threshold, the method 600 flows to the block 725 and the SPI maintains the counter at its current value. As discussed herein, selective throttling of geometry waves is performed using the counter value determined based on the lifetime of the geometry shader wave (or group) in conjunction with counter values determined based on buffer fullness at a primitive hub and counter values that are determined based on relative resource usage of the geometry shader waves and the pixel shader waves.
In some embodiments, the geometry shader waves are throttled by adding a stall signal that has a predetermined value (e.g., a high value or 1) that is maintained until a stall count goes to another predetermined value such zero. While the stall signal remains high, resources are not granted are allocated to geometry shader waves. The stall count is determined based on the FIFO status data generated by the primitive hub, resource usage data generated by the SPI, and lifetimes of the geometry waves. For example, the stall count can be generated by applying an OR operation to select the largest stall count among the three options disclosed above. The minimum stall count is set to zero and a maximum stall count of 1024 are used in some embodiments.
Throttling by the primitive hub (or based on the backpressure generated by the primitive hub) is performed based on the value of register fields that control the number of dead cycles indicated by the stall counter. A first field indicates a number of dead cycles added on a transition from a “no throttle” condition to a throttling condition. The second field indicates an increment or decrement to the dead cycles on each sample. If throttling is enabled and the next geometry shader wave has been granted resources, the stall counter is loaded with the stall count and starts to down count. The number of dead cycles to be added is determined on sample but the count is used in response to the next geometry shader wave being granted resources.
Throttling by the SPI is determined based on the resource usage information, as discussed herein. In some embodiments, there are multiple triggers for throttling geometry shader wave launches.
The first trigger is based on the LDS usage by geometry shader waves. The geometry shader wave launch is throttled in response to the measured usage exceeding a threshold. Some embodiments of the trigger generation logic use the following modes:
The second trigger is based on the VGPR usage by geometry shader waves. The geometry shader wave launch is throttled in response to the measured usage exceeding a threshold. Some embodiments of the trigger generation logic use the following modes:
The third trigger is set based on an average number of cycles during which a pixel shader wave is stalled. The number of cycles of pixel shader wave stall is sampled at predetermined time intervals, e.g., after predetermined numbers of clock cycles. The third trigger is set if the following conditions are met:
A computer readable storage medium may include any non-transitory storage medium, or combination of non-transitory storage media, accessible by a computer system during use to provide instructions and/or data to the computer system. Such storage media can include, but is not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc, magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media. The computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
In some embodiments, certain aspects of the techniques described above may implemented by one or more processors of a processing system executing software. The software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium. The software can include the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. The non-transitory computer readable storage medium can include, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like. The executable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device may not be required, and that one or more further activities may be performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.
Number | Name | Date | Kind |
---|---|---|---|
7594095 | Nordquist | Sep 2009 | B1 |
9594560 | Ananthakrishnan | Mar 2017 | B2 |
9965321 | Duluk et al. | May 2018 | B2 |
20060139365 | Naoi | Jun 2006 | A1 |
20060161762 | Eisen et al. | Jul 2006 | A1 |
20070139421 | Chen | Jun 2007 | A1 |
20090033672 | Jiao et al. | Feb 2009 | A1 |
20090150657 | Gschwind et al. | Jun 2009 | A1 |
20090295804 | Goel | Dec 2009 | A1 |
20110037769 | Chen | Feb 2011 | A1 |
20120096474 | Jiao | Apr 2012 | A1 |
20130021360 | Gruber | Jan 2013 | A1 |
20130169634 | Goel et al. | Jul 2013 | A1 |
20130169636 | Yang et al. | Jul 2013 | A1 |
20140092092 | Li et al. | Apr 2014 | A1 |
20140098117 | Goel et al. | Apr 2014 | A1 |
20140285500 | Lindholm et al. | Sep 2014 | A1 |
20150221063 | Kim et al. | Aug 2015 | A1 |
20150348226 | Vaishampayan et al. | Dec 2015 | A1 |
20160148424 | Chung et al. | May 2016 | A1 |
20170031708 | Chen et al. | Feb 2017 | A1 |
20170249149 | Priyadarshi et al. | Aug 2017 | A1 |
20170372506 | Surti | Dec 2017 | A1 |
20180211435 | Nijasure et al. | Jul 2018 | A1 |
20190042410 | Gould et al. | Feb 2019 | A1 |
20190129756 | Kazakov et al. | May 2019 | A1 |
20190236749 | Gould et al. | Aug 2019 | A1 |
20190259193 | Harris et al. | Aug 2019 | A1 |
20190266069 | Schluessler et al. | Aug 2019 | A1 |
20190317807 | Basu et al. | Oct 2019 | A1 |
20200004460 | Gould | Jan 2020 | A1 |
20200098082 | Gutierrez et al. | Mar 2020 | A1 |
20210049729 | Paltashev et al. | Feb 2021 | A1 |
Number | Date | Country |
---|---|---|
10-2012-0064093 | Jun 2012 | KR |
10-2014-0039076 | Mar 2014 | KR |
10-2017-0021312 | Feb 2017 | KR |
Entry |
---|
Hsiao et al., Demand look-ahead memory access scheduling for 3D graphics processing units, Aug. 6, 2013, Springer Science Business Media New York 2013. (Year: 2013). |
U.S. Appl. No. 17/121,965, filed Dec. 15, 2020 in the name of Nishank Pathak et al. |
U.S. Appl. No. 17/217,050, filed Mar. 30, 2021 in the name of Chriistopher J. Brennan et al. |
International Search Report and Written Opinion dated Mar. 3, 2022 for PCT/US2021/061387, 8 pages. |
Notice of Allowance issued in U.S. Appl. No. 17/121,965 dated Jul. 19. 2022, 5 pages. |
Non-Final Office Action dated Mar. 18, 2022 for U.S. Appl. No. 17/121,965, 12 pages. |
Hao Wang et al., “Memory scheduling towards high-throughput cooperative heterogeneous computing,” 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), IEEE, 2014, 11 pages. |
International Search Report and Written Opinion dated Apr. 1, 2022 for PCT/US2021/063251, 10 pages. |
International Search Report and Written Opinion dated Jun. 2, 2022 for PCT/US2022/017156, 10 pages. |
Non-Final Office Action issued in U.S. Appl. No. 17/217,050, dated Sep. 6, 2022 36 pages. |
Notice of Allowance issued in U.S. Appl. No. 17/217,050, dated Mar. 9, 2023 21 pages. |
Number | Date | Country | |
---|---|---|---|
20220188963 A1 | Jun 2022 | US |