1. Field of the Invention
Embodiments of the present invention relate generally to the field of computing devices and more specifically to a technique for reducing power consumed during frame updates through compression and local storage of display and cursor data.
2. Description of the Related Art
High performance mobile computing devices typically include high performance microprocessors and graphics adapters as well as large main memories. Since each of these components consumes considerable power, the battery life of a high performance mobile computing device is usually quite short. For many users, battery life is an important consideration when deciding which mobile computing device to purchase. Thus, longer battery life is something that sellers of high performance mobile computing devices desire.
As mentioned, the graphics adapters found in most high performance mobile computing devices consume considerable power, even when performing tasks like refreshing the screen for display. For example, a typical graphics adapter may refresh the screen twenty to sixty times per second. For each screen refresh, the graphics adapter usually reads several blocks of display data store in main memory, creates a frame from this display data, and then transmits the frame for display. Transmitting the read requests from the graphics adapter to the main memory consumes power, reading the blocks of display data from main memory consumes power, and creating the frame as well as transmitting the frame for display consumes power. Further, this sequence of events usually involves several intermediate logic blocks, such as a bus controller and a memory controller, each of which also consumes power.
Refreshing the screen begins with display logic 114 requesting arbitration logic 116 to read some or all screen addresses, defined by line and pixel coordinates, from the display data 138 in the main memory 106. This request causes arbitration logic 116 to schedule a read operation. Arbitration logic 116 prioritizes all outstanding read and write requests within the FB UMA 110 and transmits requests to unrolling logic 118 in order of priority. For example, since display logic 114 uses the current display data 138 to refresh the screen within a fixed time period (e.g., one-twentieth to one-sixtieth of a second), read operations contributing to screen refresh are assigned a high priority by arbitration logic 116 based on that fixed time constraint. Alternatively, other read or write operations that are not under timing constraints are assigned a lower priority by arbitration logic 116.
Once arbitration logic 116 prioritizes and transmits the high priority read operation through the interface 122 to unrolling logic 118, control logic 115 directs unrolling logic 118 to unroll the read operation into a series of smaller (e.g., 64B) read operations that are small enough for the HT bus 108 to perform in a single bus transaction. In a subsequent step of the overall read operation, the result of these smaller read operations are combined into the single, contiguous and ordered data block originally requested by display logic 114. For example, if display logic 114 requests control logic 115 to perform a high priority read operation of pixels from the cursor and display data 138, and arbitration logic 116 transmits that operation to unrolling logic 118, unrolling logic 118 will unroll the pixel read operation into a series of smaller read operations.
After unrolling logic 118 unrolls the read operation into smaller read operations, control logic 115 directs unrolling logic 118 to transmit those smaller read operations through the interface 124 to tiling logic 120. Control logic 115 then directs tiling logic 120 to determine the physical memory address for each smaller read operation based on the screen address associated with the smaller read operation initially requested by display logic 114. Control logic 115 also directs tiling logic 120 to transmit each smaller read operation with its corresponding physical address through the interface 126 to the FPCI 112.
For each smaller read operation received by the FPCI 112, the FPCI 112 transmits a read request to the memory controller 134 within the microprocessor 104 through the interface 130, the HT bus 108 and the interface 132. However, if the HT bus 108 is in power savings mode before the FPCI 112 transmits the read request to the memory controller 134, the FPCI 112 brings the HT bus 108 out of power savings mode before transmitting the request. Once one or more read requests are transmitted to the memory controller 134, the memory controller 134 reads the requested data from the main memory 106 through memory interface 136 and transmits the data to the FPCI 112. As is well-known, the memory controller 134 frequently transmits the data back to the FPCI 112 out-of-order relative to the order of read requests transmitted by the FPCI 112 to the memory controller 134. Since display logic 114 expects contiguous and ordered display data to create the frame properly, the FPCI 112 reorders and combines the smaller blocks of data received from the memory controller 134 into a single, contiguous and ordered data block that is transmitted through the interface 128 to display logic 114, which then creates the frame accordingly.
As previously described, one drawback of the foregoing process is that read operations between the GPU 102 and the main memory 106 may consume substantial power, which can reduce the battery life for mobile computing devices. More specifically, each read operation consumes power due to transmitting a read request from the FPCI 112 to the memory controller 134 through the HT bus 108 and transmitting a read response from the memory controller 134 to the FPCI 112 through the HT bus 108. Additionally, if either the HT bus 108 or memory controller 134 is in power saving mode before transmitting a request or response, bringing the HT bus 108 or the memory controller 134 out of power saving mode consumes additional power. Further, as is commonly known, reading display data from the system memory 106 consumes substantial power both in the main memory 106 and in the memory controller 134. Thus, over the course of many screen refreshes, substantial battery power is consumed.
As the foregoing illustrates, what is needed in the art is a way to reduce the amount of battery power consumed by a mobile computing device when refreshing the screen.
One embodiment of the present invention sets forth a method for configuring a graphics processing unit to refresh a screen display using data stored in a local memory and/or a main memory. The method includes the steps of setting a threshold limit in a threshold counter for determining whether cursor data and display data may be preferentially stored in the local memory but also may be stored in the main memory, configuring control logic within the graphics processing unit to read cursor data and display data from only the main memory, reading cursor data and display data related to a first frame from the main memory, and creating the first frame using the cursor data and the display data read from the main memory. The method also includes the steps of determining whether the first frame is different than a previously created frame, and adjusting a count of the threshold counter based on whether the first frame is different than the previously created frame.
Another embodiment of the present invention sets forth a method for reading display data from the local memory coupled to the graphics processing unit or from the main memory. The method includes the steps of receiving a request to execute a read operation on display data related to a first frame, partitioning the read operation into a plurality of smaller read operations, selecting a first smaller read operation to execute, partitioning the first smaller read operation into a plurality of block read operations, and selecting a first block read operation to execute. The method also includes the steps of translating a display address associated with the first block read operation into a physical address associated with a first display data buffer, determining whether a state bit corresponding to the first display data buffer is set, and reading display data related to the first block read operation from either the local memory or the main memory based on whether the state bit is set.
Yet another embodiment of the present invention sets forth a method for reading cursor data from the local memory coupled to the graphics processing unit or from the main memory. The method includes the steps of receiving a request to execute a read operation on cursor data related to a first frame, partitioning the read operation into a plurality of smaller read operations, selecting a first smaller read operation to execute, determining whether a state bit corresponding to a cursor data buffer is set, and reading cursor data related to the first smaller read operation either from the local memory or from the main memory based on whether the state bit is set.
One advantage of the present invention is that it enables display data to be compressed and stored and cursor data to be optionally compressed and stored in a memory that is local to a graphics processing unit to reduce the power consumed by a mobile computing device when performing a screen refresh operation. Compressing the display data and optionally the cursor data also reduces the relative cost of the invention by reducing the size of the local memory relative to the size that would be necessary if the data were stored locally in uncompressed form. Thus, the invention may improve mobile computing device battery life, while keeping additional costs low
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
Typical mobile computing device users spend much of their time running office applications, such as word processing or spreadsheet programs. These tasks are characterized by long periods of user and display inactivity that are occasionally interrupted by keyboard or mouse input, which cause the mobile computing device to update the display accordingly. During periods of GPU inactivity, the graphics adapter rereads the same display data from main memory many times, creating identical successive frames for display. As previously described herein, each display data read operation may involve waking up the HT bus and the memory controller, reading the corresponding data from main memory, and performing one or more HT bus transactions, consuming an undesirable amount of battery power.
Efficiencies may be realized by storing a copy of current cursor data and display data in a memory that is local to the graphics adapter, thereby eliminating the need to fetch display data from main memory between mouse inputs, keyboard inputs or display updates when the data does not change from frame to frame. Further efficiencies may be realized by partitioning the display into one or more blocks per display line and partitioning the local memory into a corresponding number of buffers whose data is updated only when the relevant blocks of display data change in main memory. Still-further efficiencies may be realized by compressing the display data stored in local memory to allow a smaller local memory to be used, thereby reducing the cost of implementing the local memory. However, cursor data is usually stored in uncompressed form since the relatively small amount of data required to store the cursor (e.g., 16KB) does not justify the complexity of compressing this data. Overall, these features may substantially reduce the power consumed in the mobile computing device relative to prior art solutions, while maintaining high graphics performance and minimizing the cost of storing cursor and display data locally.
In one embodiment of the invention, the local memory 220 may be an embedded dynamic random access memory (“eDRAM”). In other embodiments of the invention, the local memory 220 may be any technically feasible type of memory, including any type of RAM located either internally or externally to the GPU 200, without departing from the scope of the invention.
The GPU 200 may compress display data and store cursor and display data in the local memory 220 to reduce power during screen refresh by first configuring itself to use the local memory for cursor data and display data storage when the data stored in main memory has not changed, as described below in
As described in greater detail herein, cursor data and display data are read from main memory until the value in the compression counter 209, which counts the number of consecutive unchanged frames, equals the value in the threshold limit register 211, which is set by a software driver, such as software driver 140, and represents the number of consecutive unchanged frames to wait before storing cursor data and compressed display data in the local memory 220. Importantly, when the cursor and display data are being read from the local memory 220, any changes to the main memory versions of the data cause snoop logic 216 to invalidate the corresponding versions of the data in the local memory 220. If snoop logic 216, which monitors the HT bus 108 for any write operations to cursor data or display data addresses in main memory, detects that either the cursor data or display data in main memory has changed, then snoop logic 216 invalidates the buffer in the local memory 220 corresponding to the changed data by resetting the state bit for that local memory buffer in the state bit memory 218 through the interface 240. As a result of the reset state bit, during creation of the next frame, control logic 207 reads the updated data in main memory rather than the invalid data in the local memory 220. Thus, the GPU 202 always uses the most current cursor data and display data for screen refresh.
As shown, the method 300 for configuring the GPU 200 begins at a step 302, where the size of the display data blocks is configured by a software driver program. In one embodiment of the invention, referred to as “block compression,” the display may be partitioned into blocks of three alternative sizes: one block per frame line, one block per half frame line, or one block per quarter frame line (see, e.g.,
In step 304, the software driver 140 stores a predefined value in the threshold limit register 211. As previously described, the value of the threshold limit register 211 determines how many consecutive unchanged frames occur, as measured by the threshold counter 209, before cursor data and compressed display data is stored in the local memory 220. As long as the value of the threshold counter 209 is less than the value in the threshold limit register 211, any display data changes in main memory cause control logic 207 to clear the threshold counter 209. For example, if the GPU 200 is configured to start compression after ten consecutive unchanged frames, the software driver 140 stores the value ten in the threshold limit register 211, and cursor data and display data is read from main memory until ten consecutive display updates are performed without a display data change. However, if the display data in main memory changes after five consecutive display updates without a display data change, then the threshold counter 209 is reset from five to zero by control logic 207, and control logic 207 continues to read cursor data and display data from main memory. Starting display compression after a predefined number of consecutive unchanged frames reduces power consumption in situations where the display changes frequently since compressing and storing display data locally that may be quickly invalidated is quite inefficient.
In step 306, control logic 207 clears all state bits in the state bit memory 218. As described herein, when a state bit is clear, control logic 207 reads the cursor data buffer or display data buffer corresponding to that state bit from main memory rather than from the local memory 220 during frame creation. Only after one or more state bits are set is data read from the corresponding data buffers in the local memory 220. In step 308, control logic 207 configures itself to read cursor data and display data from main memory. In step 310, control logic 207 clears the threshold counter 209. In step 312, control logic 207 executes an operation to read uncompressed cursor data from the main memory and an operation to read uncompressed display data from the main memory to create a new frame for display. When reading data from only main memory, the GPU 200 operates in a manner that generally follows the description set forth in
In step 316, control logic 207 determines whether the new frame created in step 314 differs from the previous frame created. If the new frame does not differ from the previous frame, then the method proceeds to step 318, where control logic 207 increments the threshold counter 209. In step 320, control logic 207 determines whether the value of the threshold counter 209 equals the value stored in the threshold limit register 211. If the value of the threshold counter 209 equals the value stored in the threshold limit register 211, the method proceeds to step 322, where control logic 207 configures itself to preferentially read from the local memory 220, although control logic 207 may also read from main memory. Importantly, although cursor data is stored either in the local memory 220 or in the main memory, but not both simultaneously, display data may be stored in main memory or the local memory 220 or both. Again, by control logic 207 configuring itself to read cursor data and display data from both the local memory 220 and main memory, control logic 207 enables cursor data and compressed display data to be advantageously stored in local memory.
In step 324, control logic 207 executes an operation to read the cursor data needed to create a new frame for display as well as an operation to read the display data needed to create the new frame. In contrast to step 312, the cursor data and the display data may be preferentially read from the local memory 220 or read from the main memory, as the case may be, depending on whether the state bits for the relevant data buffers in the local memory 220 are set.
Returning now to step 320, if the value of the threshold counter 209 does not equal the value stored in the threshold limit register 211, then the method returns to step 312, where control logic 207 reads the cursor data and display data for creating the next frame from main memory. Returning now to step 316, if the new frame created in step 314 differs from the previous frame created, the method returns to step 310, where the threshold counter 209 is cleared.
As shown, the method for reading display data begins at a step 402, where display logic 206 requests through the interface 227 for arbitration logic 208 to read all screen addresses, defined by line and pixel coordinates, from memory. Again the display data requested may be stored in the local memory 220 and/or the main memory. In step 406, arbitration logic 208 prioritizes the read operation. Read operations related to a display update have a fixed time constraint, so arbitration logic 208 assigns a high priority to these types of read operations, while read or write operations for other purposes may be assigned a lower priority. In step 408, arbitration logic 208 initiates the high priority read operation by transmitting the read operation through the interface 234 to primary unrolling logic 210.
In step 410, primary unrolling logic 210 partitions (or “unrolls”) the read operation into a series of smaller (e.g., 32B) read operations that are small enough for the HT bus to perform as single bus transactions. After unrolling the full read operation into smaller read operations in step 412, primary unrolling logic 210 selects a first smaller read operation to process as the current smaller read operation. In step 414, the current smaller read operation is processed, as described in further detail in
As shown, the method for executing a smaller read operation begins at step 502, where primary unrolling logic 210 transmits the smaller read operation to block unrolling logic 212 through interface 228. In step 504, block unrolling logic 212 partitions the smaller read operation, as needed, into block read operations, such that each resulting block read operation is limited to reading pixels located within a single display block. In step 506, block unrolling logic 212 selects a first block read operation from the series of block read operations to process as the current block read operation.
In step 508, block unrolling logic 212 transmits the current block read operation to tiling logic 214 through interface 232. In step 510, tiling logic 214 determines the physical address of the block read operation from the screen address of the display block associated with the block read operation. Importantly, the physical address of the block read operation corresponds to the starting address of a display data buffer in either the local memory 220 or main memory where display data for the display block associated with the block read command is stored. In step 512, control logic 207 determines which state bit in the state bit memory 218 corresponds to the display data buffer identified in step 510. In step 514, control logic 207 reads the state bit identified in step 512 and, in step 516, determines whether the state bit is set. If the state bit is not set, then the display data stored in the display data buffer in the local memory 220 identified in step 510 is either not present or is invalid. The method then proceeds to step 518, where tiling logic 214 transmits the block read operation to the FPCI 204, through the interface 250, in preparation for reading the display data from main memory. In step 520, the FPCI 204 requests the display data from main memory by transmitting the block read operation to the HT bus 108, and, in step 522, the FPCI 204 receives the display data requested in step 520.
In step 524, control logic 207 creates a compressed form of the display data without disturbing the uncompressed display data originally received by the FPCI 204. In step 526, control logic 207 determines whether the size of the compressed display data is greater than the capacity of the display data buffer in the local memory 220 identified in step 510. If the size of the compressed display data does not exceed the capacity of that display data buffer, then the method proceeds to step 528, where control logic 207 stores the compressed display data in the display data buffer in the local memory 220 identified in step 510. In step 530, control logic 207 sets the state bit in the state bit memory 218 corresponding to that display data buffer, and the method proceeds to step 534.
In step 534, block unrolling logic 212 determines whether the current block read operation is the last block read operation in the series of block read operations generated in step 504. If the current block read operation is not the last block read operation, then the method proceeds to step 536, where block unrolling logic 212 selects the next block read operation in the series of block read operations. The method then returns to step 508, where that next block read operation is transmitted to the tiling logic 214 for processing. If, in step 534, block unrolling logic 212 determines that the current block read operation is the last block read operation in the series of block read operations, then the smaller block read operation has been fully processed, and the method terminates in step 538.
Returning now to step 526, if the size of the compressed display data is greater than the capacity of the display data buffer in the local memory 220 identified in step 510, then the compressed display data cannot be stored in the local memory 220, and the method simply proceeds to step 534. Returning now to step 516, if the state bit read in step 514 is set, then the display data in the display data buffer in the local memory 220 identified in step 510 is present and valid. The method then proceeds to step 532, where control logic 207 reads the display data from that display data buffer into reorder logic 222 through the interface 244. The method then proceeds to step 534.
As shown, the method for reading cursor data begins at a step 602, where display logic 206 requests through the interface 227 for arbitration logic 208 to read all cursor data from memory. Again, data requested may be stored in the local memory 220 or the main memory. Importantly, unlike display data, which, in one embodiment, is stored within a plurality of display data buffers in the local memory 220, cursor data is stored within a single cursor data buffer in the local memory 220. Thus, all of the cursor data in the cursor buffer 225 in the local memory 220 is either present or valid or that data is not present or invalid. In step 606, arbitration logic 208 prioritizes the read operation. As previously described, read operations related to a display update have a fixed time constraint, so arbitration logic 208 assigns a high priority to these read operations, while read or write operations for other purposes may be assigned a lower priority. In step 608, arbitration logic 208 initiates the high priority read operation by transmitting the read operation through the interface 234 to primary unrolling logic 210.
In step 610, primary unrolling logic 210 partitions the read operation into a series of smaller read operations that are small enough for the HT bus to perform as single bus transactions. After unrolling the read operation into smaller read operations in step 610, primary unrolling logic 210 selects a first smaller read operation to process as the current smaller read operation. In step 614, primary unrolling logic 210 transmits the current smaller read operation to tiling logic 214 through interface 230. Unlike display data block read operations, which have a screen-to-physical address translation step within tiling logic 214, cursor data smaller read operations do not need an address translation step because each cursor data read smaller operation is requested with a physical address. In step 616, control logic 207 reads the cursor state bit from the state bit memory 218 and, in step 622, determines if the cursor state bit is set. If the cursor state bit is not set, any cursor data stored in the cursor buffer 225 in the local memory 220 is either not present or invalid. The method then proceeds to step 624, where tiling logic 214 transmits the smaller read operation to the FPCI 204, through the interface 250, as a first step in reading from main memory. In step 626, the FPCI 204 requests the cursor data from main memory by transmitting the smaller read operation to the HT bus 108. In step 627, the FPCI 204 receives the cursor data requested in step 626. In step 628, control logic 207 stores the cursor data in the cursor buffer 225 in the local memory 220. In step 630, control logic 207 sets the cursor state bit, and the method proceeds to step 634.
In step 634, primary unrolling logic 210 determines whether the current smaller read operation is the last smaller read operation in the series of smaller operations generated in step 610. If the current smaller read operation is not the last smaller read operation, the method proceeds to step 636, where primary unrolling logic 210 selects the next smaller read operation in the series of smaller read operations. The method then returns to step 614, where that next smaller read operation is transmitted to the tiling logic 214 for processing. If, in step 634, the current smaller read operation is the last smaller read operation in the series of smaller read operations, then the method proceeds to step 638 and terminates.
Returning now to step 622, if the cursor state bit read in step 616 is set, then the cursor data in the cursor buffer 225 in the local memory 220 is present and valid. In step 632, control logic 207 reads the cursor data from the cursor buffer 225, and the method then proceeds to step 634.
In an alternative embodiment of the invention, referred to as “frame compression,” the GPU 200 may be configured to store some or all of an entire frame as a single display block. This single display block is compressed and stored in a single display data buffer in the portion of the local memory 220 where the compressed display data 226 is stored. Cursor data is stored in the cursor buffer 225 within the local memory 220 as well. Thus, referring back to
One advantage of the disclosed technique is that the power consumed by mobile computing devices may be substantially reduced by refreshing the screen using cursor data and display data stored in local memory. Another advantage of the disclosed technique is that the cost of implementing the local memory is lowered by compressing the display data before storing it in the local memory, relative to storing uncompressed display data.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. The scope of the present invention is determined by the claims that follow.
This application is a continuation of co-pending U.S. patent application titled, “Screen Compression For Mobile Applications,” filed on Sep. 21, 2006 and having Ser. No. 11/534,043 (Attorney Docket Number NVDA/P002646). This related application is also hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 11534043 | Sep 2006 | US |
Child | 13050708 | US |