The present disclosure relates to graphics processing, and more particularly, to a method and apparatus for pre-fetching page table information in zero and/or low frame buffer applications.
Current computer applications are generally more graphically intense and involve a higher degree of graphics processing power than their predecessors. Applications, such as games, typically involve complex and highly detailed graphics renderings that involve a substantial amount of ongoing computations. To match the demands made by consumers for increased graphics capabilities in computing applications, like games, computer configurations have also changed.
As computers, particularly personal computers, have been programmed to handle programmer's ever increasingly demanding entertainment and multimedia applications, such as high definition video and the latest 3D games, higher demands have likewise been placed on system bandwidth. Thus, methods have arisen to deliver the bandwidth needed for such bandwidth hungry applications, as well as providing additional bandwidth headroom for future generations of applications. Plus, the structures of graphics processing units in computers have been also changing and improving in attempt to not only keep pace, but to stay ahead as well.
Returning to northbridge 14, GPU 24 may be coupled via PCIe bus 25, as previously described above. GPU 24 may include a local frame buffer 28, as shown in
GPU 24 may receive data from system memory 20 via northbridge 14 and PCIe buses 22 and 25, as shown in
Local frame buffer 28 may be coupled to GPU 24 for storing part or even all of display data. Local frame buffer 28 may be configured to store information such as texture data and/or temporary pixel data, as of one of ordinary skill in the art would know. GPU 24 may be configured to exchange information with local frame buffer 28 via a local data bus 29, as shown in
If local frame buffer 28 does not contain any data, GPU 24 may execute a memory reading command to access system memory 20 via the northbridge 14 and data paths 22 and 25. One potential drawback in this instance is that the GPU 24 may not necessarily access system memory 20 with sufficiently fast speed. As a nonlimiting example, if data paths 22 and 25 are not fast data paths, the accessing of system memory 20 is slowed.
To access data for graphical-oriented processing from system memory 20, GPU 24 may be configured to retrieve such data from system memory 20 using a graphics address remapping table (“GART”), which may be stored in system memory 20 or in the local frame buffer 28, if available. The GART table may contain references of physical addresses corresponding to virtual addresses.
If the local frame buffer 28 is unavailable, the GART table is thus stored in system memory 20. Thus, GPU 24 may execute a first retrieve operation to access data from the GART table in system memory 20 so as to determine the physical address for data stored in system memory 20. Upon receiving this information, the GPU 24 may thereafter retrieve the data at the physical address in the second retrieval operation. Therefore, if local frame buffer 28 is too small for storing the GART table or is nonexistent, then GPU 24 may rely more heavily on system memory 20, and therefore suffer increased latency times, resulting from multiple memory access operations.
Thus, to utilize a display with system memory 20, three basic configurations may be utilized. The first is a contiguous memory address implementation, which may be accomplished by using the GART table, as described above. With the GART table, the GPU 24 may be able to map various non-contiguous 4 kb system memory physical pages in system memory 20 into a larger continues logical address space for display or rendering purposes. As many graphic card systems, such as the computer system 10 in
In a graphics system wherein the local frame buffer 28 is a sufficiently sized memory, the GART table may actually reside in the local frame buffer 28, as described above. The local drive bus 29 may therefore be used to fetch the GART table from the local frame buffer 28 so that address remapping may be performed by the display controller of GPU 24.
The read latency to the display in this instance (wherein the GART table is contained in local frame buffer 28) may be the summation of the local frame buffer 28 read time plus the time spent for the translation process. Since the local frame buffer 28 access may be typically be relatively fast compared to system memory 20 access, as described above, the impact on read latency is not overly great, as a result of the GART table fetch itself in this instance.
However, when there is no local frame buffer 28 in computing system 10, the GART table may be located in system memory 20, as also described above. Therefore, in order to perform a page translation (of a virtual address to a physical address), the table requests may be first issued by a bus interface unit of GPU 24. The display read address may then be translated and then a second read for the display data itself may be ultimately issued. In this case a single display read is realized as two bus interface unit system memory reads. Stated another way, read latency to the display controller of GPU 24 is double, which may not be acceptable for graphical processing operations.
Therefore, a heretofore unaddressed need exists to address the aforementioned deficiencies and shortcomings described above.
A method for a graphics processing unit (“GPU”) of a computer to maintain a local cache to minimize system memory reads is provided. The GPU may have a relatively small-sized local frame buffer or may lack a local frame buffer completely. In any instance, the GPU may be configured to maintain a local cache of physical addresses for display lines being executed so as to reduce the instances when the GPU attempts to access the system memory.
Graphics related software may cause a display read request and a logical address to be received by the GPU. In one nonlimiting example, the display read request and logical address may be received by a display controller in a bus interface unit (“BIU”) of the GPU. A determination may be made as to whether a local cache contains a physical address corresponding to the logical address received with the display read request. A hit/miss component in the BIU may make the determination.
If the hit/miss component determines that the local cache does contain the physical address corresponding to the received logical address, the result may be recognized as a “hit.” In that instance, the logical address may be thereafter converted to its physical address counterpart. The converted physical address may be forwarded by a controller to the system memory of the computer to access the addressed data. A northbridge may be positioned between the GPU and the system memory that routes communications therebetween.
However, if the hit/miss component determines that the local cache does not contain the physical address corresponding to the received logical address, a “miss” result may be recognized. In that instance, a miss pre-fetch component in the BIU may be configured to retrieve a predetermined number of cache pages from a table, such as a GART table in one nonlimiting example, stored in the system memory. In one nonlimiting example, a programmable register may control the quantity of the predetermined number of cache pages (or lines) that are retrieved from the table. In an additional nonlimiting example, the predetermined number of cache pages retrieved may be the quantity that corresponds to the number of pixels in one line of a display coupled to the GPU.
When the hit/miss test component determines that the local cache does contain the physical address corresponding to the received logical address, an additional evaluation may be made as to whether an amount of cache pages in the local cache is becoming low. If so, a hit prefetch component may generate a next cache page request, or the like, to retrieve a next available cache page from the table (i.e., GART table) in the system memory so as to replenish the amount of cache pages contained in the local cache. In this manner, the local cache may be configured to maintain a position that is sufficiently ahead of a current processing position of the GPU.
This configuration enables the GPU to minimize the number of miss determinations, thereby increasing the efficiency of the GPU. Efficiency is increased by the GPU not having to repeatedly retrieve both the cache pages containing physical addresses and the data itself from the system memory. Retrieving both the cache page containing the physical addresses and the addressed data thereafter constitutes two separate system memory access operations and is slower than if the system memory is accessed just once. Instead, by attempting to insure that the physical addresses for received logical addresses are contained in the local cache, the GPU accesses system memory once for actual data retrieval purposes, thereby operating more efficiently.
As described above, the GPU 24 of
Accordingly, GPU 24 may include a bus interface unit 30, that is configured to receive and send data and instructions. In one embodiment among others, the bus interface unit 30 may include a display read address translation component 31 configured to minimize access of system memory 20, as described above. The display read address translation component 31 of
In the nonlimiting example shown in
The components of the display read address translation component 31 may include a display read controller 32 that communicates with a page table cache (or a local cache) 34. The page table cache 34 may be configured to store up to, as a nonlimiting example, one entire display line of pages in tile format. A programmable register (not shown) may be used to set the size of the single display line depending upon the display resolution of the display, thereby adjusting the amount of data that may be stored in the page table cache 34. The register bits that may control the size of page table cache 34 may be implemented to correspond to the number of 8-tile cache lines to complete a display line, as one nonlimiting example.
Thus, in the process 50 of
In following the “miss” branch, step 56 follows, hit/miss test component 38 prompts the miss pre-fetch component 41 that operates to generate a cache request fetch command in this instance. The cache request is generated to retrieve the physical address corresponding to the received logical address. In step 58, the generated cache request fetch command is forwarded from the miss pre-fetch component 41 via the demultiplexer 44 to the northbridge 14, and onto to the system memory 20.
At the system memory 20, the GART table stored therein is accessed such that the cache data associated with the fetch command is retrieved and returned to GPU 24. More specifically, as shown in step 62, the cache request fetch command results in a number of cache lines being fetched from the GART table corresponding to a register variable in a programmable register entry, as described above. As one nonlimiting example, the register may be configured so that the page table cache 34 retains and maintains an entire display line for a display coupled to GPU 24.
Upon receiving the fetch cache lines from the GART table in system memory 20, the fetched cache lines are stored in the page table cache 34. Thereafter, in step 64, the display read controller changes the logical address associated to the fetched cache lines to the physical address in the local cache via hit/miss component 38. Thereafter, the physical address, as translated in step 64 by the hit pre-fetch component 42, is output by the demultiplexer 44 via northbridge 14 to access the addressed data, as stored in system memory 20 and corresponding to the translated physical address.
Steps 64 and 66, of the process 50 of
As stated above, the predetermined number of cache lines initially fetched in steps 56, 58, and 62 may be prescribed by a programmable register. Thus, an initial single page “miss” may result in an entire display line page address being retrieved and stored in the page table cache 34. However, with each subsequent hit/miss test performed in step 54, the result may be that the “hits” outweigh the “misses”, which may result is fewer accesses of system memory 20.
Once all of the data contained in cache line 0 in
As stated above, completion of cache line 0 moves the display read controller to cache line 1, but also shows the pre-fetching of cache line 4 (signified by the diagonal arrow extending from in cache line 1 to cache line 4). Similarly, upon completion of cache line 1, the display read controller 32 may move to cache line 2, thereby resulting in the prefetching of cache line 5, as signified by the diagonal arrow extending from cache line 2 to cache line 5. In this way, the page table cache 34 stays ahead of display read controller 32 in maintaining an additional display line of data so as to minimize the double retrieval of both the physical address and then the data associated with that address by GPU 24.
Returning to
Nevertheless, when cache line 0, in this nonlimiting example, is consumed (all data utilized), the result of decision step 72 may be a yes such that the display read controller 32 moves to the next cache line (cache line 1) stored in the page table cache 34. Thereafter, in step 74, a next cache request command is generated by hit pre-fetch component 42 so as to pre-fetch the next cache line. The hit pre-fetch component 42 forwards the next cache request command via demultiplexer 44 in BIU 30 of GPU 24 on to northbridge 14 and the GART table stored in system memory 20.
The next cache line, which may be cache line 4, is retrieved from the GART table and system memory 20 in this nonlimiting example. Cache line 4 is returned for storage in the page table cache 34. Thus, as described above, the diagonal arrows shown in
In continuing with this nonlimiting example, upon an initial “miss” from decision step 54 of
Subsequently, upon each “hit” in step 54, the determination may thereafter be made in step 72 (by hit/miss component 38) whether an additional cache line should be fetched from the GART table in system memory 20. If so, the result is that the hit pre-fetch component 42 may fetch one additional cache line, as shown in steps 74, 76, and 78. Thus, the page table cache 34 may always retain, in this nonlimiting example, a prescribed amount of physical addresses locally, thereby staying ahead of processing and minimizing the number of double data fetching operations, which slows processing operations.
It should be emphasized that the above-described embodiments and nonlimiting examples are merely possible examples of implementations, merely set forth for a clear understanding of the principles disclosed herein. Many variations and modifications may be made to the above-described embodiment(s) and nonlimiting examples without departing substantially from the spirit and principles disclosed herein. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims.