This disclosure relates to computer graphics.
A device that provides content for visual presentation generally includes a graphics processing unit (GPU). The GPU renders pixels that are representative of the content on a display. The GPU generates one or more pixel values for each pixel on the display to render each pixel for presentation.
In some instances, a GPU may implement a unified shader architecture for rendering graphics. In such instances, the GPU may configure a plurality of similar computing units to execute a pipeline of different shading operations. The computing units may be referred to as unified shading units or unified shader processors.
The techniques of this disclosure generally relate to performing shading operations associated with shader stages of a graphics rendering pipeline. For example, a graphics processing unit (GPU) may invoke one or more shading units to perform shading operations associated with a shader stage of the graphics rendering pipeline. According to aspects of this disclosure, the GPU may then perform shading operations associated with a second, different shader stage of the graphics rendering pipeline with the shading units that are designated for performing the first shading operations. For example, the GPU may perform shading operations associated with the second stage while adhering to an input/output interface associated with the first shader stage. In this way, the GPU may emulate a GPU having greater shading resources by performing multiple shading operations with the same shading units.
In an example, aspects of this disclosure relate to a method of rendering graphics that includes performing, with a hardware shading unit of a graphics processing unit designated for vertex shading, vertex shading operations to shade input vertices so as to output vertex shaded vertices, wherein the hardware unit is configured to receive a single vertex as an input and generate a single vertex as an output, and performing, with the hardware shading unit of the graphics processing unit, a geometry shading operation to generate one or more new vertices based on one or more of the vertex shaded vertices, wherein the geometry shading operation operates on at least one of the one or more vertex shaded vertices to output the one or more new vertices.
In another example, aspects of this disclosure relate to a graphics processing unit for rendering graphics that includes one or more processors configured to perform, with a hardware shading unit of the graphics processing unit designated for vertex shading, vertex shading operations to shade input vertices so as to output vertex shaded vertices, wherein the hardware unit is configured to receive a single vertex as an input and generate a single vertex as an output, and perform, with the hardware shading unit of the graphics processing unit, a geometry shading operation to generate one or more new vertices based on one or more of the vertex shaded vertices, wherein the geometry shading operation operates on at least one of the one or more vertex shaded vertices to output the one or more new vertices.
In another example, aspects of this disclosure relate to an apparatus for rendering graphics that includes means for performing, with a hardware shading unit of a graphics processing unit designated for vertex shading, vertex shading operations to shade input vertices so as to output vertex shaded vertices, wherein the hardware unit is configured to receive a single vertex as an input and generate a single vertex as an output, and means for performing, with the hardware shading unit of the graphics processing unit, a geometry shading operation to generate one or more new vertices based on one or more of the vertex shaded vertices, wherein the geometry shading operation operates on at least one of the one or more vertex shaded vertices to output the one or more new vertices.
In another example, aspects of this disclosure relate to a non-transitory computer-readable medium having instructions stored thereon that, when executed, cause one or more processors to, with a hardware shading unit designated for vertex shading, perform vertex shading operations to shade input vertices so as to output vertex shaded vertices, wherein the hardware unit is configured to receive a single vertex as an input and generate a single vertex as an output, and with the hardware shading unit that is designated for vertex shading, perform a geometry shading operation to generate one or more new vertices based on one or more of the vertex shaded vertices, wherein the geometry shading operation operates on at least one of the one or more vertex shaded vertices to output the one or more new vertices.
In another example, aspects of this disclosure relate to a method for rendering graphics that includes performing, with a hardware unit of a graphics processing unit designated for vertex shading, a vertex shading operation to shade input vertices so as to output vertex shaded vertices, wherein the hardware unit adheres to an interface that receives a single vertex as an input and generates a single vertex as an output, and performing, with the hardware unit of the graphics processing unit designated for vertex shading, a hull shading operation to generate one or more control points based on one or more of the vertex shaded vertices, wherein the one or more hull shading operations operate on at least one of the one or more vertex shaded vertices to output the one or more control points.
In another example, aspects of this disclosure relate to a graphics processing unit for rendering graphics that includes one or more processors configured to perform, with a hardware unit of the graphics processing unit designated for vertex shading, a vertex shading operation to shade input vertices so as to output vertex shaded vertices, wherein the hardware unit adheres to an interface that receives a single vertex as an input and generates a single vertex as an output, and perform, with the hardware unit of the graphics processing unit designated for vertex shading, a hull shading operation to generate one or more control points based on one or more of the vertex shaded vertices, wherein the one or more hull shading operations operate on at least one of the one or more vertex shaded vertices to output the one or more control points.
In another example, aspects of this disclosure relate to an apparatus for rendering graphics that includes means for performing, with a hardware unit of a graphics processing unit designated for vertex shading, a vertex shading operation to shade input vertices so as to output vertex shaded vertices, wherein the hardware unit adheres to an interface that receives a single vertex as an input and generates a single vertex as an output, and means for performing, with the hardware unit of the graphics processing unit designated for vertex shading, a hull shading operation to generate one or more control points based on one or more of the vertex shaded vertices, wherein the one or more hull shading operations operate on at least one of the one or more vertex shaded vertices to output the one or more control points.
In another example, aspects of this disclosure relate to a non-transitory computer-readable medium having instructions stored thereon that, when executed, cause one or more processors to perform, with a hardware unit of a graphics processing unit designated for vertex shading, a vertex shading operation to shade input vertices so as to output vertex shaded vertices, wherein the hardware unit adheres to an interface that receives a single vertex as an input and generates a single vertex as an output, and perform, with the hardware unit of the graphics processing unit designated for vertex shading, a hull shading operation to generate one or more control points based on one or more of the vertex shaded vertices, wherein the one or more hull shading operations operate on at least one of the one or more vertex shaded vertices to output the one or more control points.
In an example, aspects of this disclosure relate to a method of rendering graphics that includes designating a hardware shading unit of a graphics processing unit to perform first shading operations associated with a first shader stage of a rendering pipeline, switching operational modes of the hardware shading unit upon completion of the first shading operations, and performing, with the hardware shading unit of the graphics processing unit designated to perform the first shading operations, second shading operations associated with a second, different shader stage of the rendering pipeline.
In another example, aspects of this disclosure relate to a graphics processing unit for rendering graphics comprising one or more processors configured to designate a hardware shading unit of the graphics processing unit to perform first shading operations associated with a first shader stage of a rendering pipeline, switch operational modes of the hardware shading unit upon completion of the first shading operations, and perform, with the hardware shading unit of the graphics processing unit designated to perform the first shading operations, second shading operations associated with a second, different shader stage of the rendering pipeline.
In another example, aspects of this disclosure relate to an apparatus for rendering graphics that includes means for designating a hardware shading unit of a graphics processing unit to perform first shading operations associated with a first shader stage of a rendering pipeline, means for switching operational modes of the hardware shading unit upon completion of the first shading operations, and means for performing, with the hardware shading unit of the graphics processing unit designated to perform the first shading operations, second shading operations associated with a second, different shader stage of the rendering pipeline.
In another example, aspects of this disclosure relate to a non-transitory computer-readable medium having instructions stored thereon that, when executed, cause one or more processors to designate a hardware shading unit of a graphics processing unit to perform first shading operations associated with a first shader stage of a rendering pipeline, switch operational modes of the hardware shading unit upon completion of the first shading operations, and perform, with the hardware shading unit of the graphics processing unit designated to perform the first shading operations, second shading operations associated with a second, different shader stage of the rendering pipeline.
The details of one or more examples of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
The techniques of this disclosure generally relate to performing shading operations associated with shader stages of a graphics rendering pipeline. For example, a graphics processing unit (GPU) may invoke one or more shading units to perform shading operations associated with a shader stage of the graphics rendering pipeline. According to aspects of this disclosure, the GPU may then perform shading operations associated with a second, different shader stage of the graphics rendering pipeline with the shading units that are designated for performing the first shading operations. For example, the GPU may perform shading operations associated with the second stage while adhering to an input/output interface associated with the first shader stage. In this way, the GPU may emulate a GPU having greater shading resources by performing multiple shading operations with the same shading units.
In the example of
Examples of CPU 32 include, but are not limited to, a digital signal processor (DSP), general purpose microprocessor, application specific integrated circuit (ASIC), field programmable logic array (FPGA), or other equivalent integrated or discrete logic circuitry. Although CPU 32 and GPU 36 are illustrated as separate units in the example of
In the example shown in
GPU 36 represents one or more dedicated processors for performing graphical operations. That is, for example, GPU 36 may be a dedicated hardware unit having fixed function and programmable components for rendering graphics and executing GPU applications. GPU 36 may also include a DSP, a general purpose microprocessor, an ASIC, an FPGA, or other equivalent integrated or discrete logic circuitry.
GPU 36 also includes GPU memory 38, which may represent on-chip storage or memory used in executing machine or object code. GPU memory 38 may each comprise a hardware memory register capable of storing a fixed number of digital bits. GPU 36 may be able to read values from or write values to local GPU memory 38 more quickly than reading values from or writing values to storage unit 48, which may be accessed, e.g., over a system bus.
GPU 36 also includes shading units 40. As described in greater detail below, shading units 40 may be configured as a programmable pipeline of processing components. In some examples, shading units 40 may be referred to as “shader processors” or “unified shaders,” and may perform geometry, vertex, pixel, or other shading operations to render graphics. Shading units 40 may include a one or more components not specifically shown in
Display unit 42 represents a unit capable of displaying video data, images, text or any other type of data for consumption by a viewer. Display unit 42 may include a liquid-crystal display (LCD), a light emitting diode (LED) display, an organic LED (OLED), an active-matrix OLED (AMOLED) display, or the like.
Display buffer unit 44 represents a memory or storage device dedicated to storing data for presentation of imagery, such as photos or video frames, for display unit 42. Display buffer unit 44 may represent a two-dimensional buffer that includes a plurality of storage locations. The number of storage locations within display buffer unit 44 may be substantially similar to the number of pixels to be displayed on display unit 42. For example, if display unit 42 is configured to include 640×480 pixels, display buffer unit 44 may include 640×480 storage locations. Display buffer unit 44 may store the final pixel values for each of the pixels processed by GPU 36. Display unit 42 may retrieve the final pixel values from display buffer unit 44, and display the final image based on the pixel values stored in display buffer unit 44.
User interface unit 46 represents a unit with which a user may interact with or otherwise interface to communicate with other units of computing device 30, such as CPU 32. Examples of user interface unit 46 include, but are not limited to, a trackball, a mouse, a keyboard, and other types of input devices. User interface unit 46 may also be a touch screen and may be incorporated as a part of display unit 42.
Storage unit 48 may comprise one or more computer-readable storage media. Examples of storage unit 48 include, but are not limited to, a random access memory (RAM), a read only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer or a processor.
In some example implementations, storage unit 48 may include instructions that cause CPU 32 and/or GPU 36 to perform the functions ascribed to CPU 32 and GPU 36 in this disclosure. Storage unit 48 may, in some examples, be considered as a non-transitory storage medium. The term “non-transitory” may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term “non-transitory” should not be interpreted to mean that storage unit 48 is non-movable. As one example, storage unit 48 may be removed from computing device 30, and moved to another device. As another example, a storage unit, substantially similar to storage unit 48, may be inserted into computing device 30. In certain examples, a non-transitory storage medium may store data that can, over time, change (e.g., in RAM).
As illustrated in the example of
GPU program 52 may include code written in a high level (HL) programming language, e.g., using an application programming interface (API). Examples of APIs include Open-Computing Language (“OpenCL”), Open Graphics Library (“OpenGL”), and DirectX, as developed by Microsoft, Inc. In general, an API includes a predetermined, standardized set of commands that are executed by associated hardware. API commands allow a user to instruct hardware components of a GPU to execute commands without user knowledge as to the specifics of the hardware components.
GPU program 52 may invoke or otherwise include one or more functions provided by GPU driver 50. CPU 32 generally executes the program in which GPU program 52 is embedded and, upon encountering GPU program 52, passes GPU program 52 to GPU driver 50 (e.g., in the form of a command stream). CPU 32 executes GPU driver 50 in this context to process GPU program 52. That is, for example, GPU driver 50 may process GPU program 52 by compiling GPU program 52 into object or machine code executable by GPU 36. This object code is shown in the example of
In some examples, compiler 54 may operate in real-time or near-real-time to compile GPU program 52 during the execution of the program in which GPU program 52 is embedded. For example, compiler 54 generally represents a module that reduces HL instructions defined in accordance with a HL programming language to low-level (LL) instructions of a LL programming language. After compilation, these LL instructions are capable of being executed by specific types of processors or other types of hardware, such as FPGAs, ASICs, and the like (including, e.g., CPU 32 and GPU 36).
LL programming languages are considered low level in the sense that they provide little abstraction, or a lower level of abstraction, from an instruction set architecture of a processor or the other types of hardware. LL languages generally refer to assembly and/or machine languages. Assembly languages are a slightly higher LL language than machine languages but generally assembly languages can be converted into machine languages without the use of a compiler or other translation module. Machine languages represent any language that defines instructions that are similar, if not the same as, those natively executed by the underlying hardware, e.g., processor, such as the x86 machine code (where the x86 refers to an instruction set architecture of an x86 processor developed by Intel Corporation).
In any case, compiler 54 may translate HL instructions defined in accordance with a HL programming language into LL instructions supported by the underlying hardware. Compiler 54 removes the abstraction associated with HL programming languages (and APIs) such that the software defined in accordance with these HL programming languages is capable of being more directly executed by the actual underlying hardware.
In the example of
GPU 36 generally receives locally-compiled GPU program 56 (as shown by the dashed lined box labeled “locally-compiled GPU program 56” within GPU 36), whereupon, in some instances, GPU 36 renders an image and outputs the rendered portions of the image to display buffer unit 44. For example, GPU 36 may generate a number of primitives to be displayed at display unit 42. Primitives may include one or more of a line (including curves, splines, etc.), a point, a circle, an ellipse, a polygon (where typically a polygon is defined as a collection of one or more triangles) or any other two-dimensional (2D) primitive. The term “primitive” may also refer to three-dimensional (3D) primitives, such as cubes, cylinders, sphere, cone, pyramid, torus, or the like. Generally, the term “primitive” refers to any basic geometric shape or element capable of being rendered by GPU 36 for display as an image (or frame in the context of video data) via display unit 42.
GPU 36 may transform primitives and other state data (e.g., that defines a color, texture, lighting, camera configuration, or other aspect) of the primitives into a so-called “world space” by applying one or more model transforms (which may also be specified in the state data). Once transformed, GPU 36 may apply a view transform for the active camera (which again may also be specified in the state data defining the camera) to transform the coordinates of the primitives and lights into the camera or eye space. GPU 36 may also perform vertex shading to render the appearance of the primitives in view of any active lights. GPU 36 may perform vertex shading in one or more of the above model, world or view space (although it is commonly performed in the world space).
Once the primitives are shaded, GPU 36 may perform projections to project the image into a unit cube with extreme points, as one example, at (−1, −1, −1) and (1, 1, 1). This unit cube is commonly referred to as a canonical view volume. After transforming the model from the eye space to the canonical view volume, GPU 36 may perform clipping to remove any primitives that do not at least partially reside within the view volume. In other words, GPU 36 may remove any primitives that are not within the frame of the camera. GPU 36 may then map the coordinates of the primitives from the view volume to the screen space, effectively reducing the 3D coordinates of the primitives to the 2D coordinates of the screen.
Given the transformed and projected vertices defining the primitives with their associated shading data, GPU 36 may then rasterize the primitives. For example, GPU 36 may compute and set colors for the pixels of the screen covered by the primitives. During rasterization, GPU 36 may apply any textures associated with the primitives (where textures may comprise state data). GPU 36 may also perform a Z-buffer algorithm, also referred to as a depth test, during rasterization to determine whether any of the primitives and/or objects are occluded by any other objects. The Z-buffer algorithm sorts primitives according to their depth so that GPU 36 knows the order in which to draw each primitive to the screen. GPU 36 outputs rendered pixels to display buffer unit 44.
Display buffer unit 44 may temporarily store the rendered pixels of the rendered image until the entire image is rendered. Display buffer unit 44 may be considered as an image frame buffer in this context. Display buffer unit 44 may then transmit the rendered image to be displayed on display unit 42. In some alternate examples, GPU 36 may output the rendered portions of the image directly to display unit 42 for display, rather than temporarily storing the image in display buffer unit 44. Display unit 42 may then display the image stored in display buffer unit 78.
To render pixels in the manner described above, GPU 36 may designate shading units 40 to perform a variety of shading operations (as described in greater detail, for example, with respect to
In an example, GPU 36 may designate shading units 40 to perform vertex shading and pixel shading operations. In this example, GPU 36 may lack the resources to designate shading units 40 to perform operations associated with a hull shader, a domain shader, and/or a geometry shader. That is, hardware and/or software restrictions may prevent GPU 36 from designating shading units 40 to perform hull shading, domain shading, and/or geometry shading operations. Accordingly, GPU 36 may be unable to support shader stages associated with APIs that include such functionality.
For example, predecessor GPUs that supported the previous DirectX 9 API (developed by Microsoft, which may include the Direct3D 9 API) may be unable to support DirectX 10 API (which may include the Direct3D 10 API). That is, at least some of the features of the DirectX 10 API (e.g., such as certain shader stages) may be unable to be performed using predecessor GPUs. Moreover, GPUs that supported the previous DirectX 9 API and the DirectX 10 API may be unable to support all features of the DirectX 11 API. Such incompatibilities may result in a large number of currently deployed GPUs that may no longer provide support for executing software or other applications that rely on DirectX 10 or DirectX 11. While the example above is described with respect to Microsoft's DirectX family of APIs, similar compatibility issues may be present with other APIs and legacy GPUs 36.
In addition, supporting a relatively longer graphics processing pipeline (e.g., a rendering pipeline having additional shader stages) may require a more complex hardware configuration. For example, introducing a geometry shader stage to the rendering pipeline to perform geometry shading, when implemented by a dedicated one of shading units 40, may result in additional reads and writes to the off-chip memory. That is, GPU 36 may initially perform vertex shading with one of shading units 40 and store vertices to storage unit 48. GPU 36 may also read vertices output by the vertex shader and write the new vertices generated when performing geometry shading by one of shading units 40. Including tessellation stages (e.g., a hull shader stage and domain shader stage) to a rendering pipeline may introduce similar complexities, as described below.
Additional reads and writes to off-chip memory may consume memory bus bandwidth (e.g., a communication channel connecting GPU 36 to storage unit 48) while also potentially increasing the amount of power consumed, considering that the reads and writes each require powering the memory bus and storage unit 48. In this sense, implementing a graphics pipeline with many stages using dedicated shading units 40 for each shader stage may result in less power efficient GPUs. In addition, such GPUs 36 may also perform slower in terms of outputting rendered images due to delay in retrieving data from storage unit 48.
Aspects of this disclosure generally relate to merging the function of one or more of shading units 40, such that one of shading units 40 may perform more than one shading function. For example, typically, GPU 36 may perform a rendering process (which may be referred to as a rendering pipeline having shader stages) by designating shading units 40 to perform particular shading operations, where each of shading units 40 may implement multiple instances of the same shader at the same time. That is, GPU 36 may designate one or more of shading units 40 to perform vertex shading operations, e.g., supporting up to 256 concurrent instances of a vertex shader. GPU 36 may also designate one or more of shading units 40 to perform pixel shading operations, e.g., supporting up to 256 concurrent instances of a pixel shader. These hardware units may store the output from executing one of the three shaders to an off-chip memory, such as storage unit 48, until the next designated hardware unit is available to process the output of the previous hardware unit in the graphics processing pipeline.
While aspects of this disclosure may refer to specific hardware shading units in the singular (e.g., a hardware shading unit), it should be understood that such units may actually comprise one or more shading units 40 (more than one shader processor), as well as one or more other components of GPU 36 for performing shading operations. For example, as noted above, GPU 36 may have a plurality of associated shading units 40. GPU 36 may designate more than one of shading units 40 to perform the same shading operations, with each of the shading units 40 configured to perform the techniques of this disclosure for merging shading operations. In general, a hardware shading unit may refer to a set of hardware components invoked by a GPU, such as GPU 36, to perform a particular shading operation.
In one example, aspects of this disclosure include performing vertex shading operations and geometry shading operations with a single hardware shading unit. In another example, aspects of this disclosure include performing vertex shading operations and hull shading operations with a single hardware shading unit. In still another example, aspects of this disclosure include performing domain shading operations and geometry shading operations with a single hardware shading unit. Aspects of this disclosure also relate to the manner in which a hardware shading unit transitions between shading operations. That is, aspects of this disclosure relate to transitioning between performing a first shading operation with the hardware shading unit and performing a second shading operation with the same hardware shading unit.
For example, according to aspects of this disclosure, GPU 36 may perform, with a shading unit 40 designated to perform vertex shading operations, vertex shading operations to shade input vertices so as to output vertex shaded vertices. In this example, shading unit 40 may be configured with an interface that receives a single vertex as an input and generates a single vertex as an output. In addition, GPU 36 may perform, with the same shading unit 40, a geometry shading operation to generate one or more new vertices based on one or more of the vertex shaded vertices. The geometry shading operation may operate on at least one of the one or more vertex shaded vertices to output the one or more new vertices. Again, while described with respect to a single shading unit 40, these techniques may be concurrently implemented by a plurality of shading units 40 of GPU 36.
Certain APIs may require that a shading unit 40 designated to perform vertex shading operations implements or adheres to a 1:1 interface, which receives a single vertex as an input and generates a single vertex as an output. In contrast, a shading unit 40 designated to perform geometry shading operations may implement or adhere to a 1:N interface, which receives one or more vertices as an input and generates one or more (and often many, hence the use of “N” above) vertices as outputs.
According to aspects of this disclosure, GPU 36 may leverage the 1:1 interface of a shading unit 40 designated to perform vertex shading operations to emulate this 1:N geometry shader interface by invoking multiple instances of a geometry shader program. GPU 36 may concurrently execute each of these geometry shader programs to generate one of the new vertices that result from performing the geometry shader operation. That is, shading units 40 may be programmable using a HLSL (e.g., with a graphics rendering API) such that shading units 40 may concurrently execute multiple instances of what is commonly referred to as a “shader program.” These shader programs may be referred to as “fibers” or “threads” (both of which may refer to a stream of instructions that form a program or thread of execution). According to aspects of this disclosure and as described in greater detail below, GPU 36 may execute multiple instances of a geometry shader program using a hardware shading unit designated for vertex shading operations. GPU 36 may append the geometry shader instructions to the vertex shader instructions so that the same shading unit 40 executes both shaders, e.g., the vertex shader and the geometry shader, in sequence.
In another example, according to aspects of this disclosure, GPU 36 may perform, with a hardware shading unit designated to perform vertex shading operations, vertex shading operations to shade input vertices so as to output vertex shaded vertices. The hardware shading unit may adhere to an interface that receives a single vertex as an input and generates a single vertex as an output. In addition, GPU may perform, with the same hardware shading unit designated for performing vertex shading operations, one or more tessellation operations (e.g., hull shading operations and/or domain shading operations) to generate one or more new vertices based on one or more of the vertex shaded vertices. The one or more tessellation operations may operate on at least one of the one or more vertex shaded vertices to output the one or more new vertices.
For example, in addition to the shader stages described above, some graphics rending pipelines may also include a hull shader stage, a tessellator stage, and a domain shader stage. In general, the hull shader stage, tessellator stage, and domain shader stage are included to accommodate hardware tessellation. That is, the hull shader stage, tessellator stage, and domain shader stage are included to accommodate tessellation by GPU 36, rather than being performed by a software application being executed, for example, by CPU 32.
According to aspects of this disclosure, GPU 36 may perform vertex shading and tessellation operations with the same shading unit 40. For example, GPU 36 may perform vertex shading and tessellation operations in two passes. According to aspects of this disclosure and described in greater detail below, GPU 36 may store a variety of values to enable transitions between the different shading operations.
In an example, in a first pass, GPU 36 may designate one or more shading units 40 to perform vertex shading and hull shading operations. In this example, GPU 36 may append hull shader instructions to vertex shader instructions. Accordingly, the same shading unit 40 executes the vertex shading and hull shader instructions in sequence.
In a second pass, GPU 36 may designate the one or more shading units 40 to perform domain shading and geometry shading operations. In this example, GPU 36 may append domain shader instructions to the geometry shader instructions. Accordingly, the same shading unit 40 executes the domain shading and geometry shading operations in sequence. By performing multiple shading operations in multiple passes, GPU 36 may use the same shading hardware to emulate a GPU having additional shading capabilities.
Aspects of this disclosure also relate to the manner in which GPU 36 transitions between shading operations. For example, aspects of this disclosure relate to the manner in which shading operations are patched together, so that the operations are executed in sequence by the same hardware shading unit.
In an example, according to aspects of this disclosure, GPU 36 may designate one or more shading units 40 to perform first shading operations associated with a first shader stage of a rendering pipeline. GPU 36 may switch operational modes of shading unit 40 upon completion of the first shading operations. GPU 36 may then perform, with the same shading unit 40 designated to perform the first shading operations, second shading operations associated with a second, different shader stage of the rendering pipeline.
According to some examples, GPU 36 may patch shading operations together using a plurality of modes, with each mode having a particular set of associated shading operations. For example, a first mode may indicate that a draw call includes only vertex shading operations. In this example, upon executing the draw call, GPU 36 may designate one or more shading units 40 to perform vertex shading operations in accordance with the mode information. In addition, a second mode may indicate that a draw call includes both vertex shading and geometry shading operations. In this example, upon executing the draw call, GPU 36 may designate one or more shading units 40 to perform vertex shading operations. In addition, according to aspects of this disclosure, GPU 36 may append geometry shader instructions to vertex shader instructions, such that the same shading units execute both vertex and geometry shading operations. Additional modes may be used to indicate other combinations of shaders, as described in greater detail below.
In some examples, GPU driver 50 may generate the mode information used by GPU 36. According to aspects of this disclosure, the different shaders (e.g., vertex shading operations, geometry shading operations, hull shading operations, domain shading operations, and the like) do not have to be compiled in a particular manner in order to be executed in sequence by the same shading unit 40. Rather, each shader may be independently compiled (without reference to any other shader) and patched together at draw time by GPU 36. That is, upon executing a draw call, GPU 36 may determine the mode associated with the draw call and patch compiled shaders together accordingly.
The techniques of this disclosure may enable a GPU (such as GPU 36) having a limited number of shading units 40 for performing shading operations to emulate a GPU having a greater number of shading units 40. For example, while GPU 36 may be prevented from designating shading units 40 to perform more than two shading operations (e.g., vertex shading operations and pixel shading operations), the techniques of this disclosure may enable GPU 36 to perform additional shading operations (e.g., geometry shading operations, hull shading operations, and/or domain shading operations) without reconfiguring shading units 40. That is, the techniques may allow shading units 40 to adhere to input/output constraints of certain shader stages, while performing other shading operations.
Moreover, by performing multiple shading operations with the same shading units 40, the techniques may reduce memory bus bandwidth consumption. For example, in the case of vertex shading being performed with other shading operations (e.g., geometry shading), shading units 40 used for vertex shading do not need to store the vertex shading results to an off-chip memory (such as storage unit 48) prior to performing the other shader operations. Rather, vertex shading results may be stored to GPU memory 38 and immediately used for geometry shading operations.
In this manner, the techniques may reduce memory bus bandwidth consumption in comparison to GPUs having additional shading units 40, which may reduce power consumption. The techniques may therefore promote more power efficient GPUs that utilize less power than GPUs having additional hardware shader units. Accordingly, in some examples, the techniques may be deployed in power-limited devices, such as mobile devices, laptop computers and any other type of device that does not have a constant dedicated supply of power.
It should be understood that computing device 30 may include additional modules or units not shown in
Graphics processing pipeline 80 generally includes programmable stages (e.g., illustrated with rounded corners) and fixed function stages (e.g., illustrated with squared corners). For example, graphics rendering operations associated with certain stages of graphics rendering pipeline 80 are generally performed by a programmable shader processor, such as one of shading units 40, while other graphics rendering operations associated with other stages of graphics rendering pipeline 80 are generally preformed by non-programmable, fixed function hardware units associated with GPU 36. Graphics rendering stages performed by shading units 40 may generally be referred to as “programmable” stages, while stages performed by fixed function units may generally be referred to as fixed function stages.
Input assembler stage 82 is shown in the example of
Vertex shader stage 84 may process the received vertex data and attributes. For example, vertex shader stage 84 may perform per-vertex processing such as transformations, skinning, vertex displacement, and calculating per-vertex material attributes. In some examples, vertex shader stage 84 may generate texture coordinates, vertex color, vertex lighting, fog factors, and the like. Vertex shader stage 84 generally takes a single input vertex and outputs a single, processed output vertex.
Geometry shader stage 86 may receive a primitive defined by the vertex data (e.g., three vertices for a triangle, two vertices for a line, or a single vertex for a point) and further process the primitive. For example, geometry shader stage 86 may perform per-primitive processing such as silhouette-edge detection and shadow volume extrusion, among other possible processing operations. Accordingly, geometry shader stage 86 may receive one primitive as an input (which may include one or more vertices) and outputs zero, one, or multiple primitives (which again may include one or more vertices). The output primitive may contain more data than may be possible without geometry shader stage 86. The total amount of output data may be equal to the vertex size multiplied by the vertex count, and may be limited per invocation. The stream output from geometry shader stage 86 may allow primitives reaching this stage to be stored to the off-chip memory, such as memory unit 48. The stream output is typically tied to geometry shader stage 86, and both may be programmed together (e.g., using an API).
Rasterizer stage 88 is typically a fixed function stage that is responsible for clipping primitives and preparing primitives for pixel shader stage 90. For example, rasterizer stage 88 may perform clipping (including custom clip boundaries), perspective divide, viewport/scissor selection and implementation, render target selection and primitive setup. In this way, rasterizer stage 88 may generate a number of fragments for shading by pixel shader stage 90.
Pixel shader stage 90 receives fragments from rasterizer stage 88 and generates per-pixel data, such as color. Pixel shader stage 96 may also perform per-pixel processing such as texture blending and lighting model computation. Accordingly, pixel shader stage 90 may receive one pixel as an input and may output one pixel at the same relative position (or a zero value for the pixel).
Output merger stage 92 is generally responsible for combining various types of output data (such as pixel shader values, depth and stencil information) to generate a final result. For example, output merger stage 92 may perform fixed function blend, depth, and/or stencil operations for a render target (pixel position). While described above in general terms with respect to vertex shader stage 84, geometry shader stage 86, and pixel shader stage 90, each of the foregoing description may refer to on or more shading units (such as shading units 40) designated by a GPU to perform the respective shading operations.
Certain GPUs may be unable to support all of the shader stages shown in
In addition, in some examples, introducing geometry shader stage 86 to the pipeline may result in additional reads and writes to storage unit 48, relative to a graphics processing pipeline that does not include geometry shader stage 86. For example, as noted above, vertex shader stage 86 may write vertices out to off-chip memory, such as storage unit 48. Geometry shader stage 86 may read these vertices (the vertices output by vertex shader stage 84) and write the new vertices, which are then pixel shaded. These additional reads and writes to storage unit 48 may consume memory bus bandwidth while also potentially increasing the amount of power consumed. In this sense, implementing a graphics processing pipeline that includes each of the vertex shader stage 84, geometry shader stage 86, and pixel shader stage 90 may result in less power efficient GPUs that may also be slower in terms of outputting rendered images due to delay in retrieving data from storage unit 48.
As noted above, aspects of this disclosure generally relate to merging the function of one or more of shading units 40, such that a shading unit 40 designated for a particular shading operation may perform more than one shading operation. As described in greater detail below, in some examples, one shading unit 40 may be designated for performing vertex shading operations associated with vertex shader stage 84. According to aspects of this disclosure, the same shading unit 40 may also be implemented to perform geometry shading operations associated with geometry shader stage 86. That is, GPU 36 may invoke the shading unit 40 to perform vertex shading operations, but may also implement the shading unit 40 to perform geometry shading operations without re-designating the shading unit 40 to perform the geometry shading task.
For example, vertex shader stage 100 represents one or more units (such as shading units 40) that perform vertex shading operations. That is, vertex shader stage 100 may include components that are invoked by GPU 36 to perform vertex shading operations. For example, vertex shader stage 100 may receive a vertex as an input and translate the input vertex from the three dimensional (3D) model space to a two-dimensional (2D) coordinate in screen space. Vertex shader stage 100 may then output the translated version of the vertex (which may be referred to as the “translated vertex”). Vertex shader stage 100 does not ordinarily create new vertices, but operates on one vertex at a time. As a result, vertex shader stage 100 may be referred to as a one-to-one (1:1) stage, that vertex shader stage 100 receives a single input vertex and outputs a single output vertex.
Geometry shader stage 102 represents one or more units (such as shading units 40) that perform geometry shading operations. That is, geometry shader stage 102 may include components that are invoked by GPU 36 to perform geometry shading operations. For example, geometry shader stage 102 may be useful for performing a wide variety of operations, such as single pass rendering to a cube map, point sprite generation, and the like. Typically, geometry shader stage 102 receives primitives composed of one or more translated vertices, which have been vertex shaded by vertex shader stage 100. Geometry shader stage 102 performs geometry shading operations to create new vertices that may form new primitives (or possibly transform the input primitive to a new type of primitive having additional new vertices).
For example, geometry shader stage 102 typically receives a primitive defined by one or more translated vertices and generates one or more new vertices based on the received primitive. Geometry shader stage 102 then outputs the new vertices (which may form one or more new primitives). As a result, geometry shader stage 102 may be referred to as a one-to-many (1:N) or even a many-to-many (N:N) stage, in that geometry shader stage 102 receives one or more translated vertices and generates a number of new vertices.
While described as being one-to-many or even many-to-many, geometry shader stage 102 may also, in some instances, not output any new vertices or only output a single new vertex. In this respect, the techniques should not be limited to only those geometry shaders that output many vertices in every instance, but may be generally implemented with respect to any geometry shader stage 102 that may output zero, one or many new vertices, as will be explained in more detail below.
The output of geometry shader stage 102 may be stored for additional geometry shading (e.g., during stream out 104). The output of geometry shader stage 102 may also be output to a rasterizer that rasterizes the new vertices (and the translated vertices) to generate a raster image comprised of pixels.
The pixels from geometry shader stage 102 may also be passed to pixel shader stage 106. Pixel shader stage 106 (which may also be referred to as a fragment shader) may compute color and other attributes of each pixel, performing a wide variety of operations to produce a shaded pixel. The shaded pixels may be merged with a depth map and other post shading operations may be performed to generate an output image for display via a display device, such as computer monitor, television, or other types of display devices.
The shader stages shown in
For example, upon vertex shading operations being invoked by GPU 36, VS/GS stage 110 may perform both vertex shading operations and geometry shading operations. That is, merged VS/GS stage 110 may include the same set of shading units 40 for performing the operations described above with respect to vertex shader stage 100 and for performing the operations described above with respect to geometry shader stage 102.
However, because GPU 36 initially invokes each shading unit 40 as a vertex shading unit, components of GPU 36 may be configured to receive data from the vertex shading unit in a particular format, e.g., adhering to a 1:1 input/output interface. For example, GPU 36 may allocate a single entry in a cache (e.g., a vertex parameter cache, as described in greater detail below) to store the output from a shading unit 40 for a shaded vertex. GPU 36 may also perform some rasterization operations based on the manner in which the shading unit 40 is invoked. As described in greater detail below, aspects of this disclosure allow GPU 36 to perform geometry shading operations with the same shading unit as the vertex shading operations, while still adhering to the appropriate interface.
In some instances, the geometry shader stage 102 may primarily be used for low amplification of data (e.g., point-sprite generation). Such operations may require relatively low ALU usage per geometry shader invocation. Accordingly, ALUs of shading units 40 may not be fully utilized during geometry shader stage 102. According to aspects of this disclosure, geometry shader stage 102 may be appended to vertex shader stage 100 to form merged VS/GS stage 110, which may be invoked as vertex shader stage 100 in GPU architecture. Invoking the merged VS/GS stage 110 in the manner described above may increase ALU utilization by allowing both vertex shading and geometry shading operations to be performed by the same processing units.
To enable merged VS/GS stage 110, GPU 36 may perform functions for transitioning between vertex shading operations (a 1:1 stage) and geometry shading operations (a 1:N stage), as described in greater detail with respect to the example shown in
In the example of
After executing the vertex shading operations, GPU 36 may store the shaded vertices to local memory resources. For example, GPU 36 may export the vertex shader output to a position cache (e.g., of GPU memory 38), along with “cut” information (if any) and a streamid. The vertex shading operations and geometry shading operations may be separated by a VS END instruction. Accordingly, after executing the VS END instruction and completing the vertex shading operations, one or more shading units 40 designated to perform the vertex shading operations each begin performing geometry shading operations.
That is, according to aspects of this disclosure, the same shading unit 40 designated to perform vertex shading operations also performs geometry shading operations. For example, GPU 36 may change state to geometry shader specific resources (e.g., geometry shader constants, texture offsets, and the like) by changing one or more resource pointers. GPU 36 may perform this state change according to a mode (draw mode) assigned to the shading operations.
In some examples, GPU 36 may set a draw mode when executing a draw call. The draw mode may indicate which shading operations are associated with the draw call. In an example for purposes of illustration, a draw mode of 0 may indicate that the draw call includes vertex shading operations only. A draw mode of 1 may indicate that the draw call includes both vertex shading operations and geometry shading operations. Other draw modes are also possible, as described in greater detail below. Table 1 provides an example mode table having two modes:
In the example of Table 1 above, “flow” indicates the flow of operations (as executed by GPU 36) associated with the respective modes. For example, mode 0 includes vertex shading (VS) and pixel shading (PS) operations. Accordingly, GPU 36 may designate shading units 40 to perform vertex shading operations and pixel shading operations upon executing a mode 0 draw call. Mode 1 of Table 1 includes vertex shading and pixel shading operations, as well as geometry shading (GS) operations.
Accordingly, GPU 36 may designate shading units 40 to perform vertex shading operations and pixel shading operations. However, GPU 36 may also append geometry shader instructions to vertex shader instructions, so that geometry shader operations are executed by the same shading units 40 responsible for executing the vertex shader operations. The “misc” bits are reserved for variables (e.g., rel_primID, rel_vertex, GsInstance, Gsoutvertex) that are used to enable the same shading unit 40 to execute multiple different shaders in succession.
In the example of
The eight columns of the table shown in
In the example shown in
As shown in the example of
The seventh and eighth instances of the geometry shader operation are “killed” or terminated because the geometry shader operation only generates six new vertices and the outIDs of the seventh and eighth instance of the geometry shader operation do not correspond to any of the six new vertices. Thus, shading unit 40 terminates execution of the seventh and eight instances of the geometry shader operation upon determining that there is no corresponding vertex associated with these instances of the geometry shader operation.
Table 2, shown below, illustrates several parameters that may be maintained by GPU 36 to perform vertex shading operations and geometry shading operations.
Certain parameters shown in Table 2 (e.g., uv_msb, Rel_patchid) are not used for VS/GS operations, and are described in greater detail below. In the example of Table 2, index indicates the relative index of the vertices. PrimitiveID indicates the primitive ID used during the geometry shading operations to identify the primitive of the associate vertices, and may be a system generated value (e.g., generated by one or more hardware components of GPU 36). As noted above, Misc indicates reserved cache values for performing the GS operations after the VS operations. For example, table 3, shown below, illustrates parameter values when performing the vertex shading and geometry shading operations described above with respect to
While a number of fibers (e.g., instructions) are allocated for performing the vertex shading and geometry shading operations, in some instances, GPU 36 may only execute a sub-set of the fibers. For example, GPU 36 may determine whether instructions are valid (valid_as_input shown in Table 3 above) before executing the instructions with shading units 40. Because only three of the allocated fibers are used to generate shaded vertices, GPU 36 may not execute the remaining fibers (fibers 3-7 in Table 3 above) when performing vertex shading operations, which may conserve power. As described in greater detail below, GPU 36 may determine which fibers to executed by based on a mask (e.g., cov_mask_1 in
Certain APIs (e.g., the DirectX 10 API) provide for a so-called “stream out” from the geometry shader stage, where the stream out refers to outputting the new vertices from the geometry shader to a memory, such as storage unit 48, so that these new vertices may be input back into the geometry shader.
The techniques may provide support for this stream out functionality by enabling the hardware unit to output the new vertices that result from performing the geometry shader operation to storage unit 48. The new vertices output via this stream out are specified in the expected geometry shader format, rather than in the format expected by the rasterizer. The hardware unit may retrieve these new vertices and continue to implement an existing geometry shader operation, or a new geometry shader operation with respect to these vertices, which may be referred to as “stream out vertices” in this context. In this way, the techniques may enable a GPU, such as GPU 36, having a relatively limited number of shading units 40 to emulate a GPU having more shading units.
In the example shown in
The merged VS/GS hardware unit then performs vertex shading operations (142). Following the vertex shading operations, the merged VS/GS hardware shading unit may write the contents of general purpose registers (GPRs) (e.g., primitive vertices from the vertex shading operations) to local memory, such as GPU memory 38. The merged VS/GS hardware shading unit may then switch to GS texture and constant offsets (146) and a GS program counter (148), as described in greater detail below with respect to
The merged VS/GS hardware shading unit may read the contents of local memory, such as the primitive vertices from the vertex shading operations, and perform geometry shading operations (150). The merged VS/GS hardware shading unit may output one vertex attribute to a vertex parameter cache (VPC), as well as an indication of the position of the geometry shaded vertices, a stream_id, any cut indications, and any interpreted values to a position cache.
In the example shown in
For example, the hardware shading unit may write the vertex data from the vertex shading operations to local GPU memory, so that the shaded vertices are available when performing geometry shading operations. The hardware shading unit (or another component of the GPU) then executes a change mask (CHMSK) instruction that switches the resources of the hardware shading unit for geometry shading operations. For example, executing the CHMSK instruction may cause the hardware shading unit to determine which mode is currently being executed.
With respect to the Table 2 above, executing CHMSK may also cause the hardware shading unit to determine which shader stages are valid (e.g., vs_valid, gs_valid, and the like). As noted above, GPU 36 may allocate a number of fibers for performing the vertex shading and geometry shading operations. However, upon executing CHMSK, GPU 36 may only execute a sub-set of the fibers. For example, GPU 36 may determine whether instructions are valid before executing the instructions with shading units 40. GPU 36 may not execute fibers that are not valid (e.g., do not generate a shaded vertex), which may conserve power.
The hardware shading unit also executes a change shader (CHSH) instruction to switch a program counter (PC) to the appropriate state offsets for performing geometry shading operations. As described in greater detail below, this patch code (contained in the second dashed box from top to bottom, which may correspond to steps 144-148 in the example of
After executing the patch code, the hardware shader unit ceases vertex shading operations and performs geometry shading operations (contained in the third dash box from top to bottom, corresponding to step 150 in the example of
According to aspects of this disclosure, each of the shaders may be independently compiled without respect to other shaders. For example, the shaders may be independently compiled without knowledge when other shaders will be executed. After compilation, GPU 36 may patch together the shaders using the patch code shown in
The patch code described above may be added to compiled shaders by a driver for GPU 36, such as GPU driver 50. For example, GPU driver 50 determines which shaders are required for each draw call. GPU driver 50 may attach the patch code shown in
In this way, GPU 36 may patch shading operations together using a plurality of modes, with each mode having a particular set of associated shading operations. Such techniques may enable GPU 36 to perform additional shading operations (e.g., geometry shading operations, hull shading operations, and/or domain shading operations) without reconfiguring shading units 40. That is, the techniques may allow shading units 40 to adhere to input/output constraints of certain shader stages, while performing other shading operations.
In the example of
In an example for purposes of illustration, a DirectX 10 dispatch mechanism may be implemented using the graphics processing unit 178 shown in
At the output of VPC 182, PC 184 will generate primitive connectivity based on GS output primitive type. For example, the first output vertex from a GS (of VS/GS 180) may typically consist of “cut” bit in the position cache, which may indicate completion of a primitive (strip) before this vertex. PC 184 also sends this connectivity information for complete primitives to VPC 182 along with streamid for VPC 182 to stream out GS outputs to buffers 204 tied with a given stream. If there is a partial primitive between full primitives in GS 180, such a partial primitive is marked as PRIM_AMP_DEAD for GRAS 188 to drop the primitive. PC 184 also sends dead primitive types to VPC 182 to de-allocate a parameter cache for such a primitive.
Based on maxoutputvertexcount, a GPU driver (such as GPU driver 50 shown in
A high level sequencer (HLSQ) that receives the draw call of this type may check which shader processor's local memory (LM) has enough storage for GS_LM_SIZE (e.g., possibly using a round robin approach). The HLSQ may maintain the start base address of such an allocation, as well as the address of any read or write to local memory by an allocated wave. The HLSQ may also add a computed offset within the allocated memory to the base address when writing to local memory.
Accordingly, according to aspects of this disclosure, the relationship between input and output is not 1:1 (as would be typical for a shading unit designated to perform vertex shading operations) for VS/GS 180. Rather, the GS may output one or more vertices from each input primitive. In addition, the number of vertices that are output by GS is dynamic, and may vary from one to an API imposed maximum GS output (e.g., 1024 double words (dwords), which may be equivalent to an output maximum of 1024 vertices).
That is, the GS may produce a minimum of one vertex and a maximum of 1024 vertices, and the overall output from the GS may be 1024 dwords. The GS may declare at compile time a maximum number of output vertices from the GS using the variable dcl_maxoutputvertexcount. However, the actual number of output vertices may not be known at the time GPU 36 executes the GS. Rather, the declaration dcl_maxoutputvertexcount may only be required as a parameter for the GS.
The GS may also declare the variable instancecount for the number of GS instances (operations) to be invoked per input primitive. This declaration may act as an outer loop for the GS invocation (identifying the maximum number of geometry shader instances). The maximum instancecount may be set to 32, although other values may also be used. Accordingly, the GS has access to a variable GSInstanceID in the geometry shader operations, which indicates which instance a given GS is working on. Each of the GS instances can output up to 1024 dwords, and each may have dcl_maxoutputvertexcount as a number of maximum output vertices. In addition, each GS instance may be independent of other GS instances.
The input primitive type, which GPU 36 may declare at the input of the GS, may be a point, a line, a triangle, a line with adjacency, a triangle with adjacency, and patch1-32. A triangle with adjacency may be a new feature for certain APIs, such as DirectX 10. In addition, a patch1-32 may be a further enhancement for added for the DirectX 11 API. The output primitive type from the GS can be a point, line strip, or a triangle strip. The output of the GS may go to one of four streams that may be declared in the GS, and the GS may declare how many streams are used. In general, a “stream” refers to shaded data that is either stored (e.g., to a memory buffer) or sent to another unit of the GPU, such as the rasterizer. Each vertex “emit” instruction may use an “emit stream” designation that may indicate to which stream the vertex is going.
The GS may use a “cut stream” instruction or an “emitthencut stream” instruction to complete a strip primitive type. In such examples, a next vertex will start a new primitive for a given stream. In some examples, a programmer may declare (using an API), at most, one of the streams to be used as a rasterized stream when setting up streams. In addition, four 1D buffers may be tied to one stream, but the total number of buffers tied to all of the GS streams may not exceed four. Off-chip buffers are not typically shared between streams.
When a vertex is emitted for a given stream, the subsections of the vertex for each buffer tied to the stream are written to an off-chip buffer (such as storage unit 48) as a complete primitive. That is, partial primitives are generally not written to an off-chip buffer. In some examples, the data written to the off-chip buffers may be expanded to include and indication of a primitive type, and if more than one stream is enabled for a given GS, an output primitive type for the GS may be “point” only.
The GS stage may receive a PrimitiveID parameter as an input, because the PrimitiveID is a system generated value. The GS may also output a PrimitiveID parameter, a ViewportIndex parameter, and a RenderTargetArrayIndex parameter to one or more registers. An attribute interpolation mode for the GS inputs is typically declared to be constant. In some examples, it is possible to declare the GS to be NULL, but still enable output. In such examples, only stream zero may be active. Therefore, the VS output may be expanded to list a primitive type, and may write values to buffers tied to stream zero. If the input primitive type is declared to be an adjacent primitive type, the adjacent vertex information may be dropped. That is, for example, only internal vertices of an adjacent primitive (e.g., even numbered vertex number) may be processed to form a non-adjacent primitive type.
In the case of a patch input primitive type with a NULL GS, the patch is written out as a list of points to buffers tied to the stream. If the declared stream is also rasterized, GPU 36 may render the patch as a plurality of points, as specified by patch control points. In addition, when GS is NULL, A viewportindex parameter and a rendertargetarrayindex parameter may be assumed to be zero.
Query counters may be implemented to determine how many VS or GS operations are being processed by GPU 36, thereby allowing hardware components to track program execution. Query counters may start and stop counting based on a stat_start event and a stat_end event. The counters may be sampled using a stat_sample event. The operational block that receives a stat_start and/or_stop event will start or stop counting at various points, where increment signals are sent, receive such events.
When a driver of GPU 36 needs to read such counters, the driver may send a stat_sample event through the command processor (CP), as shown and described with respect to
GPU 36 may store a variety of data to local GPU memory 38. For example, the following query counts may be maintained by the CP in hardware. In some examples, the following query counts may be formed as 64-bit counters, which may be incremented using 1-3 bit pulses from various operational blocks, as indicated below:
In addition to the values described above, there may be two stream out related query counts that are maintained per stream. These the stream out related values may include the following values:
Typically, GPU 36 may support stream out directly from the VPC. As noted above, there may be up to four streams that are supported by a GS. Each of these streams may be bound by up to four buffers, and the buffers are not typically sharable between different streams. The size of the output to each buffer may be up to 128 dwords, which is the same as the maximum size of a vertex. However, a stride may be up to 512 dwords. The output data from a stream may be stored to multiple buffers, but the data generally may not be replicated between buffers. In an example for purposes of illustration, if “color.x” is written to one of the buffers tied to a stream, then this “color.x” may not be sent to another buffer tied to same stream.
Streaming out to the buffers may be performed as a complete primitive. That is, for example, if there is space in any buffer for a given stream for only two vertices, and a primitive type is triangle (e.g., having three vertices), then the primitive vertices may not be written to any buffer tied with that stream.
If the GS is null, and stream out is enabled, the stream out may be identified as a default stream zero. When stream out is being performed, the position information may be written into the VPC as well as into the PC, which may consume an extra slot. In addition, when binning is performed (e.g., the process of assigning vertices to bins for tile based rendering), stream out may be performed during the binning pass.
In some APIs, such as DirectX 10, a DrawAuto function (that may patch and render previously created streams) may be specified that consumes stream out data. For example, a GPU driver may send an event for a stream out flush for a given stream along with a memory address. The VPC, upon receiving such an event, may send an acknowledge (ack) bit to the RBBM. The RBBM, upon receiving the ack bit writes the amount of buffer space available in a buffer (buffered filled size) to a driver specified memory or memory location.
In the mean time, a pre-fetch parser (PFP), which may be included within the command processor (CP), waits to send any draw call. Once the memory address is written, the PFP may then send a next draw call. If the next draw call is an auto draw call, the GPU driver may send a memory address containing buffer filled size as part of a packet that indicate draw calls and state changes (e.g., a so-called “PM4” packet). The PFP reads the buffer_filled_size from that memory location, and sends the draw call to the PC.
GPU 36 may initially invoke vertex shading operations, for example, upon receiving vertex shader instructions (210). Invoking the vertex shading operations may cause GPU 36 to designate one or more shading units 40 for the vertex shading operations. In addition, other components of GPU 36 (such as a vertex parameter cache, rasterizer, and the like) may be configured to receive a single output per input from each of the designated shading units 40.
GPU 36 may perform, with a hardware shading units designated for vertex shading operations, vertex shading operations to shade input vertices (212). That is, the hardware shading unit may perform vertex shading operations to shade input vertices and output vertex shaded indices. The hardware shading unit may receive one vertex and output one shaded vertex (e.g., a 1:1 relationship between input and output).
GPU 36 may determine whether to perform geometry shading operations (214). GPU 36 may make such a determination, for example, based on mode information. That is, GPU 36 may execute patch code to determine whether any valid geometry shader instructions are appended to the executed vertex shader instructions.
If GPU 36 does not perform geometry shading operations (the NO branch of step 214), GPU the hardware shading unit may output one shaded vertex for each input vertex (222). If GPU 36 does perform geometry shading operations (the YES branch of step 214), the hardware shading unit may perform multiple instances of geometry shading operations to generate one or more new vertices based on the received vertices (216). For example, the hardware shading unit may perform a predetermined number of geometry shading instances, with each instance being associated with an output identifier. The hardware shading unit may maintain an output count for each instance of the geometry shading operations. In addition, an output identifier may be assigned to each output vertex.
Accordingly, to determine when to output a geometry shaded vertex, the hardware shading unit may determine when the output count matches an output identifier (218). For example, if an output count for a geometry shading operation does not match the output identifier (the NO branch of step 218), the vertex associated with that geometry shading operation is discarded. If the output count for a geometry shading operation does match the output identifier (the YES branch of step 218), the hardware shading unit may output the vertex associated with the geometry shading operation. In this way, the hardware shading unit designated for vertex shading outputs a single shaded vertex and discards any unused vertices for each instance of the geometry shading program, thereby maintaining a 1:1 input to output ratio.
Certain stages shown in
Hull shader stage 244 receives primitives from vertex shader stage 242 and is responsible for carrying out at least two actions. First, hull shader stage 244 is typically responsible for determining a set of tessellation factors. Hull shader stage 244 may generate tessellation factors once per primitive. The tessellation factors may be used by tessellator stage 246 to determine how finely to tessellate a given primative (e.g., split the primitive into smaller parts). Hull shader stage 244 is also responsible for generating control points that will later be used by domain shader stage 248. That is, for example, hull shader stage 244 is responsible for generating control points that will be used by domain shader stage 248 to create actual tessellated vertices, which are eventually used in rendering.
When tessellator stage 246 receives data from hull shader stage 244, tessellator stage 246 uses one of several algorithms to determine an appropriate sampling pattern for the current primitive type. For example, in general, tessellator stage 246 converts a requested amount of tessellation (as determined by hull shader stage 244) into a group of coordinate points within a current “domain.” That is, depending on the tessellation factors from hull shader stage 244, as well as the particular configuration of the tessellator stage 246, tessellator stage 246 determines which points in a current primitive need to be sampled in order to tessellate the input primitive into smaller parts. The output of tessellator stage may be a set of domain points, which may include barycentric coordinates.
Domain shader stage 248 takes the domain points, in addition to control points produced by hull shader stage 244, and uses the domain points to create new vertices. Domain shader stage 248 can use the complete list of control points generated for the current primitive, textures, procedural algorithms, or anything else, to convert the barycentric “location” for each tessellated point into the output geometry that is passed on to the next stage in the pipeline. As noted above, certain GPUs may be unable to support all of the shader stages shown in
In addition, supporting a relatively longer graphics processing pipeline may require a relatively more complex hardware configuration. For example, control points, domain points, and tessellation factors from hull shader stage 244, tessellator stage 246, and domain shader stage 248 may require reads and writes to off-chip memory, which may consume memory bus bandwidth and may increase the amount of power consumed. In this sense, implementing a graphics pipeline with many stages using dedicated shading units 40 for each shader stage may result in less power efficient GPUs. In addition, such GPUs may also be slower in terms of outputting rendered images due to delay in retrieving data from off-chip memory as a result of limited memory bus bandwidth.
According to aspects of this disclosure, as described in greater detail below, shading units 40 designated by GPU 36 to perform a particular shading operation may perform more than one operation. For example, a shading unit 40 designated to perform vertex shading (VS) operations may also perform hull shading operations associated with hull shader stage 244. In another example, the same shading unit 40 may also perform domain shading operations associated with domain shader stage 248, followed by geometry shader operations associated with geometry shader stage 250.
As described in greater detail below, GPU 36 may perform the shading operations above by breaking a draw call into two sub-draw calls (e.g., pass I and pass II), with each sub-draw call having associated merged shader stages. That is, GPU 36 may invoke the shading unit 40 to perform vertex shading operations, but may also implement the shading unit 40 to perform hull shading operations during a first pass. The GPU 36 may then use the same shading unit 40 (designated to perform vertex shading operations) to perform domain shading operations and geometry shading operations without ever re-designating the shading unit 40 to perform the hull shading, domain shading, or geometry shading tasks.
Hull shader stage 244 also generates tessellation factors that may be used to control the amount of tessellation of a patch. For example, hull shader stage 244 may determine how much to tessellate based on a viewpoint and/or view distance of the patch. If an object is relatively close to the viewer in a scene, a relatively high amount of tessellation may be required to produce a generally smooth looking patch. If an object is relatively far away, less tessellation may be required.
Tessellator stage 246 receives tessellation factors and performs tessellation. For example, tessellator stage 246 operates on a given patch (e.g., a Bezier patch) having a uniform grade to generate a number of {U,V} coordinates. The {U, V} coordinates may provide texture for the patch. Accordingly, domain shader stage 248 may receive the control points (having displacement information) and the {U,V} coordinates (having texture information) and output tessellated vertices. These tessellated vertices may then be geometry shaded, as described above.
According to aspects of this disclosure, and as described in greater detail below, shading operations associated with hull shader stage 244 and domain shader stage 248 may be performed by the same shading units of a GPU (such as shading units 40). That is, for example, one or more shading units 40 may be designated to perform vertex shading operations. In addition to the vertex shading operations, the GPU may append shader instructions associated with hull shader stage 244 and domain shader stage 248 such that the shaders are executed by the same shading units in sequence and without being reconfigured to perform the tessellation operations.
In the example shown in
In some examples, tessellator stage 264 may include fixed function hardware units for performing tessellation. Tessellator stage 264 may receive tessellation factors and control points from hull shader stage 262 and output so-called domain points (e.g., {U,V} points that specify where to tessellate. Domain shader stage 266 uses these domain points to compute vertices using output patch data from hull shader stage 262. Possible output primitives from domain shader stage 266 include, for example, a point, a line, or a triangle, which may be sent for rasterization, stream out 270, or to geometry shader stage 268. If any of the tessellation factors are less than or equal to zero, or not a number (NaN), the patch may be culled (discarded without being computed further).
The shader stages shown in
According to aspects of this disclosure, more than one of the shader stages in
For example, GPU 36 may execute an input draw call that includes tessellation operations, as described above with respect to
In example for purposes of illustration, assume a draw call includes 1000 associated patches for rendering. In addition, assume that local memory has the capacity to store data associated with 100 patches. In this example, GPU 36 (or a driver for GPU, such as GPU driver 50) may split the draw call into 10 sub-draw calls. GPU 36 then performs the Pass I operations and Pass II operations for each of the 10 sub-draw calls in sequence.
With respect to Pass I operations, upon vertex shading operations being invoked by GPU 36, VS/HS stage 280 may perform both vertex shading operations and hull shading operations. That is, merged VS/HS stage 280 may include a single set of one or more shading units and may perform the operations described above with respect to vertex shader stage 260 and hull shader stage 262 in sequence. As described in greater detail below, aspects of this disclosure allow GPU 36 to perform hull shading operations with the same shading unit as the vertex shading operations, while still adhering to the appropriate interface. In some examples, hull shader instructions may be appended to vertex shader instructions using a patch code, thereby allowing the same shading unit to execute both sets of instructions.
GPU 36 may then perform Pass II operations. For example, tessellation stage 282 may perform tessellation, as described with respect to tessellation stage 264 above. Merged DS/GS stage 284 may include the same set of one or more shading units 40 as the merged VS/HS stage 280 described above. Merged DS/GS stage 284 may perform the domain shading and geometry shading operations described above with respect to domain shader stage 266 and geometry shader stage 368 in sequence. In some examples, geometry shader instructions may be appended to domain shader instructions using a patch code, thereby allowing the same shading unit to execute both sets of instructions. Moreover, these domain shader instructions and geometry shader instruction may be appended to the hull shader instructions (of Pass I), so that the same shading unit may perform vertex shading, hull shading, domain shading, and geometry shading without being re-configured.
The Pass II geometry shading operations may include essentially the same geometry shading operations as those described above. However, when beginning Pass II operations, the GPR initialized input (previously for the VS stage, now for the DS stage) may include (u, v, patch_id) produced by tessellation stage 282, rather than fetched data from the vertex fetch decoder (VFD). The PC may also compute rel_patch_id for Pass II, and may pass the patch ID information to the DS along with (u,v) computed by tessellation stage 282. Tessellation stage 282 may use tessellation factors to produce (u,v) coordinates for tessellated vertices. The output of tessellation stage 282 can be fed to merged DS/GS stage 284 to prepare tessellated for further amplification (geometry shading) or stream out 286. DS uses hull shader (HS) output control point data and HS patch constant data from the off-chip scratch memory.
In some examples, the two passes shown in
The command processor (CP) may then send a draw call for Pass II. In an example, the ratio of the amount of latency to start a first useful vertex versus the amount of work done in Pass II may be approximately less than 2%. Accordingly, in some examples, there may be no overlap between Pass I and Pass II. In other examples, as described below, the GPU may include an overlap between Pass I and Pass II operations. That is, the GPU may overlap the pixel shading operations of pixel shader stage 288 of Pass II of a previous draw call with vertex shading operations of VS/HS stage 280 of the Pass I of a current draw call, because pixel shader processing may take longer than vertex shader processing.
According to aspects of this disclosure, a primitive controller (PC) may send PASS_done event after the Pass I, which may help the hardware unit to switch to Pass II. In an example in which there may be overlap between Pass I and Pass II, the existence of Pass I operations and Pass II operations may be mutually exclusive at the shader processor executing the instructions. However, the tessellation factors for Pass II may be fetched while Pass I is still executing.
As described below with respect to
In the example of
After executing the vertex shading operations, GPU 36 may store the shaded vertices to local memory resources. For example, GPU 36 may export the vertex shader output to a position cache (e.g., of GPU memory 38). The vertex shading operations and hull shading operations may be separated by a VS END instruction. Accordingly, after executing the VS END instruction and completing the vertex shading operations, one or more shading units 40 designated to perform the vertex shading operations each begin performing hull shading operations.
The same shading unit 40 may then perform hull shading operations to generate an output patch having control points V0-V3. In this example, the shading unit 40 executes multiple instances of the hull shader operation (which are denoted by their output identifiers (Outvert) in a similar manner to the geometry shader operations described above with respect to
That is, the four columns of the table shown in
In the example of
According to aspects of this disclosure, the same shading unit 40 designated to perform vertex shading operations also performs the hull shading operations described above. Moreover, the same shading unit 40 may also perform domain shading and geometry shading operations during a second pass (Pass II) of the draw call. For example, GPU 36 may change state to shader specific resources (e.g., hull, domain, and/or geometry shader constants, texture offsets, and the like). GPU 36 may perform this state change according to a mode (draw mode) assigned to the shading operations.
Table 4, shown below, illustrates operational modes and parameters that may be maintained by GPU 36 to perform vertex shading, hull shading, domain shading, and geometry shading with the same shading unit 40.
In some instances, as indicated in Table 4 above, certain shading operations may not be performed for a particular draw call. For example, a draw call may include vertex shading, hull shading, domain shading, and pixel shading operations, but may not include geometry shading operations (as shown for Mode 3). GPU 36 may use mode information to determine which shading operations to perform when executing a draw call.
Table 5, shown below, illustrates parameter values when performing Pass II operations without performing geometry shading operations.
Table 6, shown below, illustrates parameter values when performing Pass II operations including performing geometry shading operations.
After completing the operations associated with the first pass (Pass I) as shown in
For example,
As shown in
The hardware shading unit may then perform vertex shading operations to generate one or more shaded vertices. The hardware shading unit may write the shaded vertices to local memory, so that the shaded vertices are available for hull shading operations.
The GPU may then switch the memory offsets and program counter prior to performing the hull shading operations. The GPU may perform such tasks, for example, when executing the patch code described above. The hardware shading unit may then read the shaded vertices from local memory and perform hull shading operations to generate one or more control points and tessellation factors.
The control points and tessellation factors generated during the first pass may be stored, for example, to local GPU memory. In some examples, the control points and tessellation factors may be stored in separate buffers within local GPU memory.
In some instances, in the VS portion of the shading operations, only valid VS fibers are executed (as noted above with respect to
For example,
According to aspects of this disclosure, the first pass (described with respect to
In any case, as shown in
The hardware shading unit may then perform domain shading operations to generate one or more tessellated vertices. The hardware shading unit may write the tessellated vertices to local memory, so that the tessellated vertices are available for geometry shading operations.
The GPU may then switch the memory offsets and program counter prior to performing the geometry shading operations. The GPU may perform such tasks, for example, when executing the patch code described above. The hardware shading unit may then read the tessellated vertices from local memory and perform geometry shading operations to generate one or more geometry shaded vertices, which may be stored to a vertex parameter cache.
In the example shown in
As shown in the examples of
According to aspects of this disclosure, each shader stage (VS/GS/HS/DS) may be complied separately and without knowing how the stages will be linked during execution. Accordingly, three GPRs may be reserved to store parameters such as primitiveID, rel_patch_ID and misc. The compiler may cause input attributes or internal variables to be stored in GPRs IDs beyond two for DX10/DX11 applications.
In the example of
With respect to a dispatch mechanism for DirectX 11, a draw call may be divided in two pass draw by CP 344. Based on available storage to store output of Pass I, a draw call may be divided into multiple sub-draw calls, with each sub-draw call having a Pass I and a Pass II. Each sub-draw call may adhere to the ordering of passes, such that Pass I is performed for a sub-draw call, followed by Pass II for the sub-draw call.
Upon receiving a sub-draw call with Pass I, PC 336 may fetch indices and process a patch primitive type using VS/HS 332. VS/HS 332 creates HS_FIBERS_PER_PATCH=2ceil(log2(max(inputpatch, outputpatch))) VS fibers per patch and fits integer number of patches per wave (where a wave is a given amount of work). There is no vertex reuse at the input. Since the output of the VS/HS 332 is transferred off-chip to system scratch 356, there may be no allocation of position and parameter cache.
Based on HS_FIBERS_PER_PATCH a GPU driver (such as GPU driver 50 shown in
The driver may also add additional size to HS_LM_SIZE if the driver is to write intermediate data to local memory before writing the final data to memory 348. Such additional space may be useful if HS is using a computed control point in multiple phases of the HS (e.g., in a constant phase of the HS). A high level sequencer (HLSQ) that receives the draw call of this type may check which shading unit's local memory (LM) has enough storage for GS_LM_SIZE. The HLSQ may maintain the start base address of such an allocation, as well as the address of any read or write to local memory by an allocated wave. The HLSQ may also add a computed offset within the allocated memory to the base address when writing to local memory.
System interpreted values (SIV) (e.g., clip/cull distances, rendertarget, viewport) may also be provided to VPC 334 for loading into PS 346. A shader stage (e.g., VS or GS) may conditionally output the values. Accordingly, if PS 346 needs the values, PS 346 may set such a condition as part of a state. If PS 346 does not need the values, and such a determination is done after compilation of the pixel shading operations, the state of outputting these SIVs can be reset so that VS or GS will not write the values to VPC 334 at draw time.
For null GS (if no geometry shader stage is being executed), the compiler may also create a template GS, so that there is no separate path for null or non-null GS. This template GS may copy VS or domain shader (DS) output to local memory and further copy from local memory to output to VPC 334. This may only be done for a case in which stream out is performed.
The process of binning and consuming a visibility streams may be different, depending on which shaders are being implemented. For example, certain GPUs may divide image data to be rendered into tiles or “bins,” rendering each bin successively (or sometimes concurrently or in parallel) until the entire image is rendered. By dividing the image into bins, the GPUs may reduce on-chip memory requirements while also promoting less data retrieval from off-chip memory (considering that the on-chip memory may be large enough to store sufficient image data to render the tile).
With respect to a visibility stream, a Z-buffer algorithm may be used to determine primitives that are occluded by other primitives (and therefore do not need to be rendered). For example, the GPU may draw each primitive, working from the back-most (depth-wise) primitive to the front-most (again, depth-wise) primitive. In this example, some primitives may be rendered only to be drawn over by other primitives.
As a result of this so-called “overdraw,” GPUs may be adapted to perform early Z-buffer algorithm testing, which allows the GPUs to identify primitives that are entirely occluded or not within the eye view to be ignored or bypassed when the GPU performs rendering. In this respect, GPUs may be adapted to determine what may be referred to as visibility information with respect to each primitive and/or object.
With respect to DX10, during the binning pass, PC 336 sends “end of primitive” to GRAS 340 at the end of all the output primitives from a GS. Therefore, visibility information is recorded per input primitive. Stream out may be performed during the binning pass. CP 344 can read all stream out buffer related information at the end of the binning pass. Geometry related query counters may be updated during the binning pass.
A visibility pass may read the visibility stream and advance the stream as visibility information per primitive is read. If no stream is rasterized, then the visibility pass may be skipped. Otherwise, PC 336 checks for visibility input GS primitive and process to render without any streamouts.
With respect to DX11, during a binning pass, PC 336 sends “end of primitive” to GRAS 340 at the end of all the output primitives from a GS in Pass II (e.g., one bit per input patch). Stream out may be performed as described above. During a visibility pass, a visibility stream is processed in Pass I along with patches (only patches with visibility may be processed). Pass II only processes visible patches and fetches tessellation factors for visible patches only.
Table 7, shown below, provides information regarding the binning pass and rendering pass for each of five different modes of operation. Each mode corresponds to certain operations being performed by a single hardware shading unit, as described above.
In the example of
If the draw call does include tessellation operations, GPU 36 may determine the size of local GPU memory resources, such as GPU memory 38 (384). GPU 36 may then split the draw call into a plurality of sub-draw calls (386). In some examples, each sub-draw call may include the Pass I operations and Pass II operations described above. For example, Pass I operations may include vertex shading operations and hull shading operations, while Pass II operations may include domain shading operations and geometry shading operations.
The amount of data rendered by each sub-draw call may be determined based on the size of GPU memory 38. For example, GPU 36 may configure the sub-draw calls so that GPU 36 is able to store all of the data generated by the Pass I operations to local memory for use with Pass II operations. In this way, GPU 36 may reduce the amount of data being transferred between local GPU memory and memory external to the GPU, which may reduce latency associated with rendering, as described above.
After determining the sub-draw calls, GPU 36 may perform Pass I operations for the first sub-draw call (388). As noted above, Pass I operations may include performing vertex shading operations and hull shading operations using the same hardware shading unit, e.g., each of one or more shading units 40. That is, while GPU 36 may designate a number of shading units 40 to perform vertex shading, each of the shading units 40 may perform both vertex shading and hull shading operations.
GPU 36 may also perform Pass II operations for the first sub-draw call (390). As noted above, Pass II operations may include performing domain shading operations and geometry shading operations using the same one or more shading units 40. Again, while GPU 36 may designate a number of shading units 40 to perform vertex shading, each of the shading units 40 may perform Pass II operations such that each of shading units 40 performs vertex shading operations, hull shading operations, domain shading operations, and geometry shading operations.
GPU 36 may also perform pixel shading operations for the sub-draw call (392). GPU 36 may perform pixel shading operations using one or more other shading units 40. In other examples, GPU 36 may perform pixel shading for an entire draw call after all of the sub-draw calls are complete.
GPU 36 may then determine whether the completed sub-draw call is the final sub-draw call of the draw call (392). If the sub-draw call is the final sub-draw call of a draw call, GPU 36 may output the rendered graphics data associated with the draw call. If the sub-draw call is not the final sub-draw call fo the draw call, GPU 36 may return to step 388 and perform Pass I operations for the next sub-draw call.
It should be understood that the steps shown in
In the example of
In this sense, each of the shading units 40 change operational modes to perform hull shading operations. However, the mode change does not include re-designating the shading units 40 to perform the hull shading operations. That is, components of GPU 36 may still be configured to send data to and receive data from in the 1:1 interface format of a shading unit designated for vertex shading operations.
GPU 36 may then perform hull shading operations associated with a hull shader stage of a graphics rendering pipeline using the same shading units 40 that performed the vertex shading operations, as described above (404). For example, each shading unit 40 may operate on shaded vertices to generate one or more control points, which may be used for tessellation.
It should be understood that the steps shown in
In the example of
After performing the domain shading operations, each of the designated shading units 40 may store the domain shaded vertices to local memory for geometry shading operations (402). GPU 36 may also change a program counter for tracking hull shading operations, as well as change one or more resource pointers to a hull shader resources offset. In examples in which the operations of
In this sense, each of the shading units 40 change operational modes to perform domain shading and geometry shading operations. However, the mode change does not include re-designating the shading units 40 to perform the domain shading and geometry shading operations. That is, components of GPU 36 may still be configured to send data to and receive data from in the 1:1 interface format of a hardware shading unit designated for vertex shading operations.
GPU 36 may then perform geometry shading operations associated with a geometry shader stage of a graphics rendering pipeline using the same shading units 40 that performed the domain shading operations, as described above (424). For example, each shading unit 40 may operate on domain shaded vertices to generate one or more geometry shaded vertices.
It should be understood that the steps shown in
In the example of
Upon completing the operations associated with the first shader stage, GPU 36 may switch operational modes, allowing the same shading units 40 to perform a variety of other shading operations (442). For example, as described above, GPU 36 may change a program counter and one or more resource pointers for performing second shading operations.
In some examples, GPU 36 may switch the operational mode of the shading units 40 based on mode information associated with the draw call being executed. For example, a driver of GPU 36 (such as GPU driver 50) may generate a mode number for a draw call that indicates which shader stages are to be executed in the draw call. GPU 36 may use this mode number to change operational modes of the shading units upon executing a patch code, as described above.
Table 8, shown below, generally illustrates mode information including mode numbers for a variety of combinations of shader stages.
As shown in Table 8, each mode dictates which shader stages are performed by shading units. Accordingly, GPU 36 can string shader instructions together, allowing the same shading units 40 to perform multiple shading operations. That is, GPU 36 can patch together the appropriate shader instructions based on the mode number of the draw call being executed.
In this way, GPU 36 may then perform second shading operations with the same shading units 40 designated to perform the first shading operations (444). For example, GPU 36 may perform a combination of vertex shading operations, hull shading operations, domain shading operations, and geometry shading operations, as shown in Table 8 above.
It should be understood that the steps shown in
While certain examples described above include initially designating hardware shading units to perform vertex shading operations and transitioning to performing other shading operations with the same hardware shading units, it should be understood that the techniques of this disclosure are not limited in this way. For example, a GPU may initially designate a set of hardware shading units to perform a variety of other shading operations. That is, in a system that allows GPU to designate hardware shading units to perform three different shading operations, GPU may designate hardware shading units to perform vertex shading operations, hull shading operations, and pixel shading operations. In this example, GPU may initially designate one more hardware shading units to perform hull shading operations, but may also perform domain shading operations and geometry shading operations with the same hardware shading units, as described above. A variety of other operational combinations are also possible.
In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions or code on an article of manufacture comprising a non-transitory computer-readable medium. Computer-readable media may include computer data storage media. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The code may be executed by one or more processors, such as one or more DSPs, general purpose microprocessors, ASICs, FPGAs, or other equivalent integrated or discrete logic circuitry. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules. Also, the techniques could be fully implemented in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
Various examples have been described. These and other examples are within the scope of the following claims.
This application is a continuation of U.S. patent application Ser. No. 16/711,098 filed 11 Dec. 2019, which claims the benefit of U.S. continuation of U.S. patent application Ser. No. 13/830,075 filed 14 Mar. 2013, which claims the benefit of U.S. Provisional Application 61/620,340 filed 4 Apr. 2012, U.S. Provisional Application 61/620,358 filed 4 Apr. 2012, and U.S. Provisional Application 61/620,333 filed 4 Apr. 2012, the entire contents of all of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5109504 | Littleton | Apr 1992 | A |
5870604 | Yamagishi | Feb 1999 | A |
6268875 | Duluk, Jr. et al. | Jul 2001 | B1 |
7109987 | Goel et al. | Sep 2006 | B2 |
7196710 | Fouladi et al. | Mar 2007 | B1 |
7468726 | Wloka et al. | Dec 2008 | B1 |
7570267 | Patel et al. | Aug 2009 | B2 |
7671862 | Patel et al. | Mar 2010 | B1 |
7701459 | Mrazek et al. | Apr 2010 | B1 |
7739473 | Nordquist | Jun 2010 | B1 |
7978205 | Patel et al. | Jul 2011 | B1 |
8134566 | Brown et al. | Mar 2012 | B1 |
8289341 | Sarel et al. | Oct 2012 | B2 |
8436854 | Jiao et al. | May 2013 | B2 |
8462159 | Lake et al. | Jun 2013 | B2 |
8482560 | Sathe et al. | Jul 2013 | B2 |
8499305 | Jiao | Jul 2013 | B2 |
8922565 | Street | Dec 2014 | B2 |
9412197 | Goel et al. | Aug 2016 | B2 |
9947130 | Hasselgren et al. | Apr 2018 | B2 |
10535185 | Goel et al. | Jan 2020 | B2 |
10559123 | Goel et al. | Feb 2020 | B2 |
20020104077 | Charnell et al. | Aug 2002 | A1 |
20030200538 | Ebeling et al. | Oct 2003 | A1 |
20050225554 | Bastos et al. | Oct 2005 | A1 |
20050243094 | Patel et al. | Nov 2005 | A1 |
20070083870 | Kanakogi | Apr 2007 | A1 |
20080094408 | Yin et al. | Apr 2008 | A1 |
20080270753 | Achiwa et al. | Oct 2008 | A1 |
20090051687 | Kato | Feb 2009 | A1 |
20090073168 | Jiao et al. | Mar 2009 | A1 |
20090122068 | Garritsen | May 2009 | A1 |
20090147017 | Jiao | Jun 2009 | A1 |
20090189896 | Jiao et al. | Jul 2009 | A1 |
20090237401 | Wei et al. | Sep 2009 | A1 |
20090295804 | Goel et al. | Dec 2009 | A1 |
20100079454 | Legakis | Apr 2010 | A1 |
20100123717 | Jiao et al. | May 2010 | A1 |
20100164954 | Sathe | Jul 2010 | A1 |
20100328309 | Sarel et al. | Dec 2010 | A1 |
20110037769 | Chen et al. | Feb 2011 | A1 |
20110050716 | Mantor et al. | Mar 2011 | A1 |
20110057931 | Goel | Mar 2011 | A1 |
20110080404 | Rhoades et al. | Apr 2011 | A1 |
20110084975 | Duluk, Jr. et al. | Apr 2011 | A1 |
20110084976 | Duluk, Jr. et al. | Apr 2011 | A1 |
20110102448 | Hakura et al. | May 2011 | A1 |
20110115802 | Mantor et al. | May 2011 | A1 |
20110267346 | Howson | Nov 2011 | A1 |
20110310102 | Chang | Dec 2011 | A1 |
20120223947 | Nystad | Sep 2012 | A1 |
20120229460 | Fortin | Sep 2012 | A1 |
20130265307 | Goel et al. | Oct 2013 | A1 |
20130265308 | Goel et al. | Oct 2013 | A1 |
20130265309 | Goel et al. | Oct 2013 | A1 |
20200118328 | Goel et al. | Apr 2020 | A1 |
Number | Date | Country |
---|---|---|
1702692 | Nov 2005 | CN |
1741066 | Mar 2006 | CN |
101017566 | Aug 2007 | CN |
101156176 | Apr 2008 | CN |
101271584 | Sep 2008 | CN |
101714247 | May 2010 | CN |
101877116 | Nov 2010 | CN |
101894358 | Nov 2010 | CN |
101937556 | Jan 2011 | CN |
102016928 | Apr 2011 | CN |
102135916 | Jul 2011 | CN |
102184522 | Sep 2011 | CN |
102272798 | Dec 2011 | CN |
2488667 | Sep 2012 | GB |
2010086528 | Apr 2010 | JP |
2011505633 | Feb 2011 | JP |
2011227864 | Nov 2011 | JP |
5596136 | Sep 2014 | JP |
20060044935 | May 2006 | KR |
20100036183 | Apr 2010 | KR |
20110112828 | Oct 2011 | KR |
20120025534 | Mar 2012 | KR |
WO-2007049610 | May 2007 | WO |
WO-2009058845 | May 2009 | WO |
WO-2009073516 | Jun 2009 | WO |
WO-2010078153 | Jul 2010 | WO |
WO-2010138870 | Dec 2010 | WO |
WO-2010141612 | Dec 2010 | WO |
WO-2011028986 | Mar 2011 | WO |
WO-2011135316 | Nov 2011 | WO |
2013151751 | Oct 2013 | WO |
Entry |
---|
Loop, “Hardware Subdivision and Tessellation of Catmull-Clark Surfaces” by Charles Loop May 11, 2010, Technical Report MSR-TR-2010-163, Microsoft Corporation (Year: 2010). |
Advisory Action dated Apr. 15, 2016 from U.S. Appl. No. 13/830,075 (7 pages). |
Advisory Action from U.S. Appl. No. 13/829,900 dated Nov. 24, 2015 (3 pages). |
Amendment from U.S. Appl. No. 13/829,900, filed Jun. 26, 2015, 16 pp. |
Amendment from U.S. Appl. No. 13/830,145,dated Sep. 3, 2015, 12 pp. |
Amendment in response to Final Office Action dated Sep. 10, 2015 and Advisory Action dated Nov. 24, 2015 from U.S. Appl. No. 13/829,900, filed Dec. 10, 2015 (17 pages). |
Amendment in response to Final Office Action dated Sep. 10, 2015 from U.S. Appl. No. 13/829,900, filed Nov. 10, 2015 (17 pages). |
Amendment in response to Office Action dated Jan. 20, 2016 from U.S. Appl. No. 13/829,900, filed Apr. 20, 2016 (17 pages). |
Anonymous: “Direct3D 11,” InternetArchive-WayBackMachine / Wikipedia, Nov. 11, 2009, Retrieved from the Internet: URL: https://web.archive.org/web/ZOO91111075731/https://en.wikipedia.org/wiki/Direct3D#cite_note-gamefest . . . [retrieved on Jul. 5, 2018] (10 pp). |
Blythe D., “Rise of the Graphics Processor”, Proceedings of the IEEE, IEEE, New York, US, vol. 96, No. 5, May 1, 2008 (May 1, 2008), XP011207764, ISSN: 0018-9219, pp. 761-778. |
Blythe D., “SIGGRAPH 2006 Course 3 Notes GPU Shading and Rendering: Chap 2: Direct3d 10 (David Blythe)”, Microsoft Corporation, 2006, XP002738238, Retrieved from the Internet URL: http://www.csee.umbc.edu/˜olano/s2006c03/ch02.pdf, [retrieved on Apr. 8, 2015], p. 2-1, pgph.1; p. 2-2, pgph.1; p. 2-2 to 2-3 section “Geometry Shader Stage”; p. 2-4, section “PrimitiveID” p. 2-6, section “1.2.1 Vertex Shader Stage,” therin, pgph.1; p. 2-6, section “1.2.2 Geometry Shader Stage” p. 2-9, pgph.2-3. |
Chung K., et al., “Memory Bandwidth Saving by Hardware Tessellation with Vertex Shader,” Electronics Letters, Feb. 26, 2009, vol. 45 (5), pp. 259-261, XP006032654, ISSN: 1350-911X, DOI: 10.1049/EL: 20092624 the whole document. |
Final Office Action from U.S. Appl. No. 13/830,075 dated Jan. 5, 2016 (33 pages). |
Final Office Action from U.S. Appl. No. 13/829,900, dated Sep. 10, 2015, 42 pp. |
Final Rejection from U.S. Appl. No. 13/829,900, dated Jul. 21, 2016, 19 pp. |
Foley T., et al., “Spark: Modular, Composable Shaders for Graphics Hardware”, ACM Transactions on Graphics—Proceedings of ACM SIGGRAPH 2011, SIGGRAPH 2011 Jul. 2011 Association for Computing Machinery USA, vol. 30, No. 4, Jul. 1, 2011 (Jul. 1, 2011), XP002738836, pp. 1-12, DOI: 10.1145/1964921.1965002. |
Goto H, “Will DirectX 11 Change Graphics Chip?,” Nikkei Win PC, Japan, Nikkei Business Publications, Inc., Sep. 1, 2009, vol. 15, 14th issue, pp. 174-175. |
International Preliminary Report on Patentability from International Applicaiton No. PCT/US2013/032098, dated Jun. 9, 2015, 10 pp. |
International Preliminary Report on Patentability from International Application No. PCT/US2013/032123, dated Aug. 6, 2015, 15 pp. |
International Preliminary Report on Patentability from International Application No. PCT/US2013/032136, dated Aug. 6, 2015, 14 pp. |
International Search Report and Written Opinion from International Application No. PCT/US2013/032136, dated Jul. 28, 2015, 19 pp. |
International Search Report and Written Opinion—PCT/US2013/032098—ISA/EPO—dated May 11, 2015. |
International Search Report and Written Opinion—PCT/US2013/032123—ISA/EPO—dated Jul. 28, 2015. |
Kim T-Y., et al., “A Unified Shader Based on the OpenGL ES 2.0 for 3D Mobile Game Development”, Jun. 11, 2007 (Jun. 11, 2007), Technologies for E-Learning and Digital Entertainment; [Lecture Notes in Computer Science;; LNCS], Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 898-903, XP019061730, ISBN: 978-3-540-73010-1. |
Kobayashi A., et al., “A Video Compositing System Using GPU for Live Video Performance,” Journal Transaction of IPSJ Heisei 22nd 2 [CD-ROM], Japan, The Information Processing Society of Japan, Apr. 15, 2011, vol. 4, 1, pp. 76-89. |
Kovacs D., et al., “Real-Time Creased Approximate Subdivision Surfaces with Displacements,” IEEE Transactions on Visualization and Computer Graphics, vol. 16, No. 5, Sep./Oct. 2010, pp. 742-751. |
Loop C., et al., “Approximating Subdivision Surfaces with Gregory Patches for Hardware Tessellation”, Dec. 1, 2009 (Dec. 1, 2009), pp. 1-9, XP002721769, Retrieved from the Internet: URL: http://research.microsoft.com/en-us/um/people/cloop/sga09.pdf [retrieved on Mar. 14, 2014] figure 2. |
Loop C., “Hardware Subdivision and Tessellation of Catmuii-Ciark Surfaces”, Technical Report MSR-TR-2010-163, Microsoft Corporation, May 11, 2010, pp. 1-15. |
Ni T., et al., “Efficient Substitutes for Subdivision Surfaces”, ACM SIGGRAPH 2009 Courses, Aug. 3, 2009 (Aug. 3, 2009), pp. 1-107, XP055084843, New York, NY, USA DOI: 10.1145/1667239.1667252. |
Ni T., et al., “GPU Smoothing of Quad Meshes”, IEEE International Conference on Shape Modeling and Applications, SMI 2008, Piscataway, NJ, USA, 7 Pages, Jun. 4, 2008 (Jun. 4, 2008), pp. 3-9, XP031275301, ISBN: 978-1-4244-2260-9, section “4 GPU Implementation” on p. 6-7. |
Non-Final Office Action from U.S. Appl. No. 13/830,075 dated Sep. 16, 2016 (45 pages). |
Notice of Allowance from U.S. Appl. No. 13/830,145 dated Apr. 11, 2016 (8 pages). |
Notice of Allowance from U.S. Appl. No. 13/830,145, dated Nov. 18, 2015, 7 pp. |
NRTTKR: “Direct3D 11”, I/O, Japan Society of Engineering Corporation, Nov. 1, 2008 , vol. 33, No. 11, pp. 92-94. |
“NVIDIA GeForce 8 GPU”, Ashu Rege, Director of Developer Technology, 2007, 112 Pages. |
Office Action from U.S. Appl. No. 13/829,900 dated Jan. 20, 2016 (17 pages). |
Office Action from U.S. Appl. No. 13/829,900, dated Mar. 26, 2015, 33 pp. |
Office Action from U.S. Appl. No. 13/830,075, dated Jun. 11, 2015, 28 pp. |
Office Action from U.S. Appl. No. 13/830,145, dated Jun. 3, 2015, 5 pp. |
Owens J.D., et al., “GPU Computing”, Proceedings of the IEEE, IEEE, New York, US, vol. 96, No. 5, May 1, 2008 (May 1, 2008), pp. 879-899, XP011207684, ISSN: 0018-9219. |
Partial International Search Report—PCT/US2013/032123—ISA/EPO—dated May 22, 2015. |
Partial International Search Report—PCT/US2013/032136—ISAEPO—dated May 22, 2015. |
Response to Final Office Action dated Jan. 5, 2016 from U.S. Appl. No. 13/830,075, filed Mar. 4, 2016 (17 pages). |
Response to Office Action dated Jun. 11, from U.S. Appl. No. 13/830,075, filed Sep. 11, 2015, 16 pp. |
Response to Office Action dated Sep. 16, 2016, from U.S. Appl. No. 13/830,075, filed Dec. 16, 2016 10 pp. |
Riffel A., et al., “Mio: Fast Multipass Partitioning via Priority-based Instruction Scheduling”, Proceedings of the SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware—Graphics Hardware 2004—Eurographics Symposium Proceedings 2004 Association for Computing Machinery USA, 2004, pp. 35-44, XP002738522, p. 36, left col., ppgh.4-5. |
State Intellectual Property Office of the People's Republic of China Notification of Examination Decision for Invalidation Request, Application/Patent No. 201380018326.0; Case No. 4W107661, (FiledAug. 14, 2018; Apple v Qualcomm), Aug. 26, 2019, 30 pages. |
Valdetaro A., et al., “Understanding Shader Model 5.0 with DirectX 11”, IX SBGAMES, Brazilian Symposium on Computer Games and Digital Entertainment, Floranopolis, Nov. 8-10, 2010, Nov. 8, 2010 (Nov. 8, 2010), pp. 1-18, XP055169555, Retrieved from the Internet: URL: http://webserver2.tecgraf.puc-rio.br/˜abraposo/pubs/SBGames2010/SBGames2010_Tutorial.pdf [retrieved on Feb. 13, 2015]. |
Wittenbrink C.M., et al., “Fermi GF100 GPU Architecture”, IEEE Micro, IEEE Service Center, Los Alamitos, CA, US, vol. 31, No. 2, Mar. 1, 2011 (Mar. 1, 2011), pp. 50-59, XP011353945, ISSN: 0272-1732, DOI: 10.1109/MM.2011.24, section “Tessellation” starting on p. 54. |
Number | Date | Country | |
---|---|---|---|
20220068015 A1 | Mar 2022 | US |
Number | Date | Country | |
---|---|---|---|
61620358 | Apr 2012 | US | |
61620340 | Apr 2012 | US | |
61620333 | Apr 2012 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16711098 | Dec 2019 | US |
Child | 17522178 | US | |
Parent | 13830075 | Mar 2013 | US |
Child | 16711098 | US |