Many algorithms on a graphics processing unit (GPU) may benefit from doing a query in a hierarchical tree structure (including quad-trees, oct-trees, kd-trees, R-trees, and so forth). However, these trees can be very deep, whereby traversing the tree uses many fetches from memory to reach the “leaf” node where the data typically resides. For example, to traverse a tree in the pixel shader, the traversal typically starts at the root of the tree, with many memory fetches needed to work down to a leaf to retrieve the actual data for the query point.
Consider a scenario in which lightmap data is stored in a sparse quad-tree instead of using a traditional texture map; in such a tree, the empty areas of the texture take up no memory, and areas of low variance can use lower resolutions. An algorithm may query lightmap data in the pixel shader, replacing a traditional texture map lookup. The pixel shader queries the 2D quad-tree using a UV-coordinate as with normal 2D textures. If a hypothetical quad tree covers the equivalent of a typical high definition screen resolution and the total depth of the tree is twelve, e.g. 1920×1080 multiplied by 12 fetches per pixel=24,883,200. This is a significant computational expense, and any reduction in the number of fetches needed to locate needed data is highly desirable.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which a hierarchal tree of nodes is traversed by stages a graphics pipeline, including a first stage having a first traversal mechanism configured to traverse at least some of a hierarchical tree structure of data in a coarse traversal until a first state is detected. A second traversal mechanism traverses at least some of the hierarchical tree structure in a fine traversal (relative to the coarse traversal) until a second state is detected, in which the fine traversal uses state information from the coarse traversal to avoid traversing at least some nodes traversed in the coarse traversal. One or more additional stages corresponding to additional coarseness and/or fineness may be present in the pipeline and traverse the tree in coarse-to-fine traversal operations.
In one aspect, a higher stage in a graphics pipeline traverses a hierarchal tree structure in a higher stage traversal until a stopping criterion is met. State information corresponding to the higher stage traversal is provided to a lower stage in the graphics pipeline. The lower stage, upon receiving the state information, uses the state information to determine one or more starting points for a lower stage traversal, which comprises traversing the hierarchal tree of nodes based upon the one or more starting points until another stopping criterion is met.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards performing parts of tree traversal at different stages in the graphics pipeline, in a coarse-to-fine fashion. In one aspect, each coarser stage in the GPU pipeline does as much of the tree-traversal as possible (limited by the “coarseness” of that pipeline stage), and passes on the intermediate state of the traversal to the next (finer-grained) stage, which refines it further. As will be understood, the technology exploits the hierarchical coarse-to-fine nature of the graphics pipeline, by linking it to the hierarchical coarse-to-fine nature of the trees.
In general, the technology herein leverages pixel-to-pixel coherency, in which two adjacent pixels represented in a tree structure are likely to share a large portion of the path from the root, to avoid repeating work for each pixel. For example, moving some of the traversal work to the geometry shader stage of the graphics pipeline (which runs per-triangle, instead of per-pixel) instead of having the pixel shader stage perform the entire traversal, results in considerable work being shared for pixels.
It should be understood that any of the examples herein are non-limiting. For one, while examples are described in the hierarchical nature of object rendering on a GPU, a similar strategy may be employed for more general tasks, for three-dimensional or two-dimensional objects, and/or for various types of hierarchical tree structures. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and tree traversal in general.
The illustrated stages 108, 110 and 112 are programmed to each contain a traversal mechanism, represented by blocks 114, 116 and 118, respectively. The stages can communicate data to one another, such that a higher level traversal mechanism is able to pass intermediate state results (IR1 or IR2) to the next lower traversal mechanism.
As generally described herein, in an implementation with patch-processing capabilities, some of the traversal may be performed on the patch by the patch traversal mechanism 114, with the resulting intermediate state results IR1 (e.g., comprising a starting node or nodes for that next level) passed down to the geometry traversal mechanism 116.
In general, the patch traversal mechanism traverses the tree downwards, until a node is reached that no longer contains the entire patch data under that node. This can be stated as detecting child nodes that intersect. The intermediate state may correspond to one or more nodes, with a predetermined threshold set for when to stop traversing, e.g., stop once there are more than N nodes (typically a relatively small number) that intersect, and send state information comprising the parent node or all the child nodes (if N was set to one), or the appropriate subset of child nodes that intersect if N is greater than one. The next stage in the hierarchy loops through paths starting with the node or nodes passed, thereby ignoring ones not intersected, and traversing nodes that intersect). Note that in less structured trees where child nodes can overlap (e.g. bounding-box trees), it is common for a region to touch more than one child node, whereby sending multiple nodes may be beneficial.
As can be readily appreciated, depending on the exact type of tree used, it also may be beneficial to have some of the nodes, particularly near the root, be “loose,” such that the children overlap slightly. This avoids having to stop traversal early because a primitive slightly overlaps child nodes.
When the intermediate state is received in the example of
As can be readily appreciated, the technology may be used at any relevant hierarchical levels. Further, coarse-to-fine traversal also may be extended above the patch processing stage. For example, if dealing with a tree whose root node covers more than just one object, some tree traversal on the CPU may be performed.
Turning to an example, consider processing a sparse quad-tree as generally represented in
Instead of starting traversal at the pixel shader stage, described herein is first performing a tree traversal at a higher stage, e.g., on the “patch” in
This is represented in
Note that in the example of
As described herein, the patch traversal mechanism sends the intermediate state (node N4's information) down for processing each tessellation primitive at the geometry shader stage. Note that the primitives in
In
In the example twelve-level tree described above (about 8×8 pixels per triangle, about 16×16 triangles per patch), the per-patch traversal needs 12−4−3=5 fetches. Thus, the total number of fetches in coarse-to-fine processing may be five for the patch, then four for each of the 16×16 triangles in the patch, and then three for each of the 8×8 pixels in each triangle, totaling 50,181 fetches for the whole patch.
At steps 502 and 504, the patch traversal mechanism starts at the root node and traverses the nodes of the tree to find a lowest child node or (subset of child nodes) that contains the data of an entire patch under that node. When this node or subset of nodes is reached, at step 506 corresponding state information is passed (e.g., an identity of each node or nodes) to the geometry shader stage. Note that in the above exaggerated example of
With respect to such “loose” nodes, many types of tree structures may be tweaked with some amount of “looseness” to avoid having to stop traversal early on the coarse stages of the pipeline. For example, if a triangle barely intersects both children while traversing the tree in the geometry shader, instead of ending the traversal and performing the remaining traversal in the pixel shader, a tree with some looseness may have a child that entirely covers the triangle, even though other children intersect the triangle, whereby the process may continue traversing for at least one more level. Thus, instead of detecting intersection, the stopping criterion rule corresponds to whether a child node (or subset of nodes) contains the entire triangle data, regardless of whether there is some other intersection.
Steps 508, 510 and 512 are similar to steps 502, 504 and 506 and are not separately described, except to note that the geometry shader performs them on appropriate primitives, and that the traversal stopping criterion or rule may differ from that of the higher level. During the traversal, the paths taken by the geometry shader can start at the node (or set of nodes) passed to it from the higher level, because the entire patch of primitives is known to be under that node or subset of nodes.
Step 514 represents the pixel shader stage receiving the state information passed from the higher, geometry shader stage. Step 516 traverses the pixels starting from each node (which may be a single node) for each path taken. Step 518 outputs the results when the pixel traversal completes.
In sum, each level determines a lowest node or subset of child nodes during traversal that contains the primitive's data for that level, and passes that state information to the next lowest level. For example, the geometry shader, which may act on individual triangles or quads, keeps traversing until the triangle or quad is no longer entirely covered by a child node. Then the last-fully-covering node or set of nodes is passed on to the pixel shader, which traverses from that node (or each node of the set) down to the leaf nodes and retrieves the requested data.
Note that the above-described coarse-to-fine technology was only one example, and the technology is applicable to any hierarchical structure to be queried in a coherent fashion at one of the lower stages of the GPU pipeline. For example, a domain shader tree-lookup may be sped up by doing some traversal in the patch-level, in an analogous way. Other trees instead of a 2D quad-tree (e.g., queried by a traditional UV coordinate) may be similarly traversed, e.g., a 3D oct-tree may instead be traversed, using the world space position to query data. Other types of trees may be used to facilitate filtering (such as Random-Access trees), or others to achieve any other property of interest.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 610 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 610 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 610. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 630 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 631 and random access memory (RAM) 632. A basic input/output system 633 (BIOS), containing the basic routines that help to transfer information between elements within computer 610, such as during start-up, is typically stored in ROM 631. RAM 632 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 620. By way of example, and not limitation,
The computer 610 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 610 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 680. The remote computer 680 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 610, although only a memory storage device 681 has been illustrated in
When used in a LAN networking environment, the computer 610 is connected to the LAN 671 through a network interface or adapter 670. When used in a WAN networking environment, the computer 610 typically includes a modem 672 or other means for establishing communications over the WAN 673, such as the Internet. The modem 672, which may be internal or external, may be connected to the system bus 621 via the user input interface 660 or other appropriate mechanism. A wireless networking component 674 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 610, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 699 (e.g., for auxiliary display of content) may be connected via the user interface 660 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 699 may be connected to the modem 672 and/or network interface 670 to allow communication between these systems while the main processing unit 620 is in a low power state.
Alternatively, or in addition, the functionally described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.