The present invention is illustrated by way of example, and not by way of limitation, in the Figures of the accompanying drawings and in which like reference numerals refer to similar elements.
Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the preferred embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of embodiments of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be recognized by one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments of the present invention.
Some portions of the detailed descriptions, which follow, are presented in terms of procedures, steps, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, computer executed step, logic block, process, etc., is here, and generally, conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing terms such as “processing” or “accessing” or “compressing” or “storing” or “rendering” or the like, refer to the action and processes of a computer system (e.g., computer system 100 of
It should be appreciated that the GPU 110 can be implemented as a discrete component, a discrete graphics card designed to couple to the computer system 100 via a connector (e.g., AGP slot, PCI-Express slot, etc.), a discrete integrated circuit die (e.g., mounted directly on the motherboard), or as an integrated GPU included within the integrated circuit die of a computer system chipset component (e.g., integrated within the bridge chip 105). Additionally, a local graphics memory 116 can optionally be included for the GPU 110 to provide high bandwidth graphics data storage.
Embodiments of the present invention implement a method for delayed frame buffer merging. In one embodiment, the GPU utilizes a tag value and a sub-portion of a frame buffer tile to store a coverage mask. The coverage mask corresponds to the degree of coverage of the tile (e.g., the number of samples covered). The pixels comprising the frame buffer tile can be stored in a compressed state by storing the color of a polygon and the coverage mask of the polygon into the memory location that stores the tile. Furthermore, additional polygons can be rendered into the tile by storing a subsequent coverage mask for a new polygon and a color for the new polygon into the memory location.
This enables new polygons to be rendered into the tile without having to access and write to the frame buffer. For example, polygons can be rendered into the tile using the delayed frame buffer merging process until the tile is full, at which point the tile can be merged into the frame buffer. In this manner, the delayed frame buffer merging process of the present invention can accumulate updates from arriving polygons into a tile within the limited size of the low latency memory (e.g., registers, caches) of the GPU 110, as opposed to having to read and write to the frame buffer (e.g., stored in local graphics memory 116 or in the system memory 115) and thereby incur high latency performance penalties. The delayed frame buffer merging process is described in greater detail in
The steps of the process 200 embodiment of
Process 200 begins in step 201 where GPU 110 accesses a polygon related to a group of pixels stored at a memory location. During the rendering process, the GPU 110 receives primitives, usually triangle polygons, which define the shapes, positions, and attributes of the objects comprising a 3-D scene. The hardware of the GPU processes the primitives and implements the calculations required to produce realistic 3D images on the display 112. At least one portion of this process involves the rasterization and anti-aliasing of polygons into the pixels of a frame buffer, whereby the GPU 110 determines the degree to which each of the pixels of the frame buffer are affected by each of the graphics primitives comprising a scene. In one embodiment, the GPU 110 processes pixels as groups, which are often referred to as tiles. These groups, or tiles, typically comprise four pixels per tile (e.g., although tiles having 8, 12, 16, or more pixels can be implemented). In one embodiment, the GPU 110 is configured to process two adjacent tiles (e.g., comprising eight pixels).
In step 202, process 200 determines which pixels of the group are covered by the polygon. This determination as to which pixels are covered by the polygon is illustrated in
In step 203, a coverage mask is generated corresponding to the samples that are covered by the polygon 301. In one embodiment, the coverage mask can be implemented as a bit mask with one bit per sample of the group. Thus, 16 bits can represent the 16 samples of the group, with each bit being set in accordance with whether that sample is covered or not. Thus, in a case where the polygon 301 partially covers the pixels of the group, and thus partially covers the 16 samples, this information, namely the degree of coverage, can be updated into the group by storing the resulting coverage mask and the color of the polygon 301 into the memory location storing the tile.
Importantly, it should be noted that this update can occur within memory internal to the GPU 110. This memory stores the pixel group as it is being rasterized and rendered against polygons. Thus a polygon can be rasterized and rendered into the pixel group without having to read the pixel group from the frame buffer, update the pixel group, and then write the updated pixel group back to the frame buffer (e.g., read-modify-write).
In step 204, the group of pixels is updated by storing the coverage mask and the corresponding color of the polygon into the memory location for the group. This is shown in
In this manner, the delayed frame buffer merging process of the present invention can accumulate a number of updates from arriving polygons into a pixel group while delaying the necessity of merging the updates into the frame buffer.
Referring still to process 200 of
In this manner, the delayed frame buffer merging process of the present invention can accumulate a number of updates from arriving polygons into a pixel group, thereby delaying the necessity of a merge operation until the memory for the pixel group is full. This reduces the total number of merge operations, which each require a time consuming read, modify, and write to the frame buffer, which must be performed to render a given scene. As described above, the pixel group can be updated with subsequent polygons without forcing a merge into the frame buffer for each polygon.
In step 207, when the memory location 500 is full as shown in
In one embodiment, after the information is merged into the frame buffer, the GPU 110 can recompress the color information of the pixel group and store the pixel group in a compressed form in low latency memory. This color information can be compressed using coverage masks and colors as described above. This process is illustrated in
It should be noted that if a subsequent polygon is received that completely covers all of the pixels of the group, all the samples in each pixel would be the same color and can thus be 4 to 1 compressed and stored as a single color in, for example, the top left quadrant. It should be noted that although embodiments of the present invention have been described in the context of 4× multisampling, the present invention would be even more useful in those situations where even higher levels of multisampling are practiced (e.g., 8× multisampling, etc.) and in applications other than anti-aliasing.
Additionally, it should be noted that in one embodiment, a tag value is used by the GPU 110 to keep track of the state of the memory location 500 for the group of pixels. This tag value enables the GPU 110 to keep track of the number of polygons that have been updated into the memory location 500. For example, in one embodiment, the tag value can be implemented as a 3 bit value, where, for example, tag value 0 indicates a 4 to 1 compression with one color per pixel, tag value 1 indicates 4 to 1 compression with two quadrants of the memory location 500 occupied, as shown in
0=uncompressed;
1=fully compressed, free pointer at sample 8;
2=multiple fragments, free pointer at sample 12;
3=free pointer at sample 16;
4=free pointer at sample 20;
5=free pointer at sample 24;
6=free pointer at sample 28;
7=memory location 500 full but still unresolved.
Thus, in accordance with the alternative embodiment, 16 byte writes are required which are not necessarily more efficient than 32 byte writes, but still save a read from the frame buffer. With deeper pixels or larger pixel footprints, the alternative embodiment method can still function with 3 bit tags. In the above described examples, the pixel groups comprise an eight pixel footprint. In a case where the pixel footprint comprises 16 pixel groups, then the process would allocate storage in eight sample increments or 32 byte grains. Alternatively, in a case where 8 byte pixels are being written, a 2×4 pixel group as used herein performs adequately for generating 32 byte writes.
The foregoing descriptions of specific embodiments of the present invention have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.
This application claims the benefit of U.S. Provisional Patent Application No. 60/802,746, Attorney Docket No. NVID-P002512 “DELAYED FRAME BUFFER MERGING WITH COMPRESSION”, by Alben, et al., which is incorporated herein in its entirety.
Number | Date | Country | |
---|---|---|---|
60802746 | May 2006 | US |