The accompanying drawings, which are incorporated in and constitute part of the specification, illustrate embodiments of the invention and, together with the general description given above and the detailed description of the embodiment given below, serve to explain the principles of the present invention. In the drawings:
The following detailed description of the present invention refers to the accompanying drawings that illustrate exemplary embodiments consistent with this invention. Other embodiments are possible, and modifications may be made to the embodiments within the spirit and scope of the invention. Therefore, the detailed description is not meant to limit the invention. Rather, the scope of the invention is defined by the appended claims.
It would be apparent to one of skill in the art that the present invention, as described below, may be implemented in many different embodiments of software, hardware, firmware, and/or the entities illustrated in the figures. Any actual software code with the specialized control of hardware to implement the present invention is not limiting of the present invention. Thus, the operational behavior of the present invention will be described with the understanding that modifications and variations of the embodiments are possible, given the level of detail presented herein.
One of the challenges associated with traditional computer systems, such as the system 100, is that having separate memories for the CPU 102 and the GPU 108 creates a higher overall system cost. An additional consideration in lap top computers, for example, is that separate memories require more battery power. Therefore, from a cost and power savings perspective, a more efficient memory configuration is one where a single system memory, which consumes less power than multiple memories, is shared between the CPU and the GPU.
Most modern source material, such as high definition (HD) video, is data intensive, thereby requiring the use of significant amounts of memory. When available communications channels between the processor and memory, such as the communications path 109, are bandwidth limited, this HD video material cannot be successfully viewed. For example, the communications path 109 may be so constrained that sufficient amounts of data cannot travel fast enough to update the display 112 during a HD video presentation. This issue arises because essentially all data must travel back and forth between the GPU 108 and the DRAM 116, across the communications path 109.
For example, when a standard graphics operation is performed within the GPU 108, data must first be read from the memory 116. This data must travel from the memory 116, across the communications path 109, to the GPU 108. The GPU 108 then operates upon or manipulates the data, sends and then returns the data across the communications path 109 for storage in the memory 116. This continuing bi-directional movement of data between the GPU 108 and the single system memory 116 is necessary because the GPU 108 does not have its own dedicated frame buffer. Thus, the system 114 suffers in performance due to the constraints of the communications path 109.
The bandwidth of the communications path 109 is essentially a fixed amount, typically about 3 giga-bytes per second in each direction. Absolute bandwidth values will rise and fall over time, but so will the demand (i.e. run faster, more complex processing, higher resolutions). This rising and falling of bandwidth demand averages out to an equivalent of the fixed bandwidth value discussed above.
This fixed bandwidth value is established by the form factor of the PC and is an industry standard. As understood by those of skill in the art, this industry standard is provided to standardize the connectivity of plug and play modules. Although the PC industry is trending towards a wider bandwidth standard, today's standard imposes significant throughput constraints. In a general sense, however, By way of background, it is known to those of skill in the art that
Because of the throughput constraints between the GPU 108 and the north bridge 106, the ability of the GPU 108 to perform specific video functions consequently becomes constrained. That is, certain graphics functions within the GPU 108 simply cannot be accomplished due to the throughput constraints of the communications channel 109.
For example, generally, graphics functions (i.e., 3D operations) will continue to function correctly, but may have degraded performance (i.e. games will be sluggish). Video processing requires real-time updates and can therefore fail. A latency issue also exists. That is, because of the use of a single system memory, memory data may be farther from GPU. Therefore, instances can arise where the GPU will stall waiting for the data. This stalling, or latency, is especially problematic for display data and can also impact general system performance.
Although conventional techniques exist that try to limit the performance impact of longer latencies, these conventional techniques add cost to the GPU and aren't particularly effective. One such technique is known in the art includes the use of an integrated graphics device. These integrated graphics devices, however, are typically optimized to minimize costs. In many cases, because costs are the primary concern, performance and efficiency suffer. Therefore, a more efficient technique is needed to optimize the flow of data within the computer system 114.
In addition to the components noted above, the computer system 200 also includes a single system memory, such as a DRAM 209. Although in the embodiment of
The GPU 204 and the north bridge 207 each includes predetermined functional modules that are configured to perform specific operations upon the data. Application drivers (not shown), executed by the CPU 202, can be programmed to dynamically control which functional modules are to be enabled within, or partitioned between, the GPU 204 and the north bridge 207. Within this framework, a user can determine, for example, that support functionality modules will be enabled within the north bridge 207 and graphics functionality modules will be enabled within the GPU 204. As a practical matter, the functions distributed between the GPU 204 and the north bridge 207 in the computer system 200 can be combined into a single integrated circuit (IC). Better performance, however, is achieved within the computer system 200 by dividing the functions across separate ICs.
Fundamentally, the ability to redistribute functions between the GPU 204 and the north bridge 207 is based upon the fact that data processing functions work as a memory-to-memory operation. That is, input data is read from a memory, such as the DRAM 209, and processed by a functional module, discussed in greater detail below. The resulting output data is then written back to the DRAM 209. In the present invention, therefore, whenever a functional module operates upon a specific portion of data within the north bridge 207 rather than in the GPU 204, this portion of data is no longer required to travel from the north bridge 207 to the GPU 204, and back. Stated another way, since this portion of data is processed within north bridge 207, it no longer needs to travel across the first communications path 208.
The first communications path 208 is also representative of a virtual channel formed between the GPU 204 and the north bridge 207. That is, the first communications path 208 can be logically divided into multiple virtual channels. A virtual channel is used to provide dedicated resources or priorities to a set of transactions or functions. By way of example, a virtual channel can created and dedicated to display traffic. Display is critical since the display screen 206 is desirably refreshed about 60 or more times per seconds. If the display data is late, the displayed images can be corrupted or may flicker. Using a virtual channel helps provide dedicated bandwidth and latency for display traffic.
Also in the computer system 200, a second communications path 210 provides an interface between the north bridge 207 and the DRAM 209. As noted above, the north bridge 207 and the GPU 204 each includes functional modules configured to perform predetermined functions on data stored within the DRAM 209. The specific types of functions performed by each of the functional modules within the GPU 204 and the north bridge 207 are not significant to operation of the present invention. However, for purposes of illustration, specific functions and functional modules are provided within the computer system 200, as illustrated in
For example, functional modules included within the GPU 204 include a graphics core (GC) 212 for performing 3-dimensional graphics functions. A peripheral component interconnect express (PCIE) interface 214 is used to decode protocols for data traveling from the north bridge 207 to a standard memory controller (MC) 216, within the GPU 204. A display block 218 is used to push data, processed within the GPU 204, out to the display screen 206. A frame buffer compression (FBC) module 220 is provided to reduce the number of internal memory accesses in order to conserve system power. In the exemplary embodiment of
Similar functional modules are included within the north bridge 207 and operate essentially the same as those included within the GPU 204. Thus, the description of these similar functional modules will not be repeated. A memory controller 224 and a PCIE interface 226 are provided to encode data traveling from the north bridge 207 to the GPU 204. In the embodiment of
As discussed above, the present invention optimizes the flow of data between the GPU 204 and the north bridge 207 by redistributing its flow. For example, assume that an instruction has been forwarded via the CPU 202 to perform a graphics core function upon data stored within the DRAM 209. In a conventional computer system arrangement, the graphics core function might be performed within the GPU 204. In the present invention, however, an apriori determination can be made to enable the GC function within the north bridge 207 instead of the GPU 204.
The north bridge 207 will likely require less power to do the processing since data is not passed through the north bridge 207, across the communications path 208, and into the GPU 204. High bandwidth links consume relatively high amounts of power. If the computer system 200 can be configured to require less bandwidth, the communication links, such as the communications path 208, can be placed into a lower power state for greater periods of time. This lower power state helps conserve power.
The apriori determination to enable the GC function within the north bridge 207 instead of the GPU 204 can be implemented by configuring associated drivers executed by the CPU 202 using techniques known to those of skill in the art. In this manner, whenever the GC function is required, data will be extracted from the DRAM 209, processed within the GC functional module within the north bridge 207, and then stored back into the DRAM 209. Data processing within the north bridge 207 precludes the need for shipping the data across the communications path 208, thus preserving the use of this path for other system functions.
For highest performance, as an example, the computer system 200 can be configured to use all functional modules in both GPU 204 and the north bridge 207 simultaneously. Configuring the computer system 200 in this manner requires a balancing between bandwidth and latency requirements. For example, processing intensive tasks that might require lower bandwidths, can be placed on the GPU 204. Low latency tasks that might require higher bandwidths, can be placed on the north bridge 207.
By way of illustration, when the computer system 200 is placed in operation, a user can be presented via the display screen 206 with an option of selecting an enhanced graphics mode. Typical industry names for enhanced graphics modes include, for example, extended 3D, turbo graphics, or some other similarly name. When the user selects this enhanced graphics mode, the drivers within the CPU 202 are automatically configured to optimize the flow of data between the north bridge 207 and the GPU 204, thus enabling the graphics enhancements.
More specifically, when an enhanced graphics mode is selected by the user, the drivers within the CPU 202 dynamically configures the functional modules within north bridge 207 and the GPU 204 to maximize the number of data processing functions performed within the north bridge 207. This dynamically configured arrangement minimizes the amount of data requiring travel across the communications path 208. In so doing, bandwidth of the communications path 208 is preserved and its throughput is maximized.
The computer system 300, among other things, addresses a separate real-time constraint issue related to display screen data refresh. That is, typical computer system displays are refreshed at a rate of at least 60 times per second, as noted above. Therefore, if the display data cannot travel across the communications path 208 in a manner supportive of this refresh rate, images being displayed on the display screen 206 can become distorted or flicker. Thus, the embodiment of
Correspondingly, as discussed above in relation to
The present invention provides a technique and a computer system to reduce the throughput constraints imposed by a communications path between a bridging device and a GPU. By carefully partitioning functionality and/or functional modules between the GPU and the bridging device, the need for certain data to travel across a narrow communications path between the GPU and the bridging device can be eliminated, thus increasing overall system throughput.
The present invention has been described above with the aid of functional building blocks illustrating the performance of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.