The present invention relates generally to the field of computer graphics rendering, and more particularly, ways of and means for improving the performance of rendering processes supported on GPU-based 3D graphics platforms associated with diverse types of computing machinery.
Real-time 3D graphics applications such as video games have two contradictory needs. On the one hand there is the requirement for high photorealism; on the other hand a high frame rate is desired. In the video game industry the trend is to push the frame rate up to high FPS rates. However, when this overtakes the screen refresh rate (typically 60 FPS) a tearing artifact occurs, badly affecting the image quality. The higher the frame rate, the worse the tearing effect. Although tearing occurs when the frame feed is not synchronized with the screen refresh rate, it may also occur when FPS is less than the screen refresh rate. However, it is statistically more likely to be seen at higher FPS.
Tearing is a visual artifact in video or 3D rendered frames (typically in, but not limited to, 3D games) where information from two or more different frames is shown in a display device simultaneously in a single screen draw.
Tearing can occur with most common display technologies and video cards, and is most noticeable on situations where horizontally-moving visuals are commonly found.
The method of prior art v-sync is illustrated in
Video games, which have a wide variety of rendering engines, tend to benefit well from vertical synchronization, as the rendering engine is normally expected to build each frame in real time, based on whatever the engine's variables specify at the moment a frame is requested. However, because vertical synchronization causes input lag, it can interfere with games which require precise timing or fast reaction times. 3D CAD applications benefit as well from vertical synchronization. These applications are known for their slower frame rate due to large amounts of data. Their tearing effect is typically caused by the screen refresh mechanism, unsynchronized with the slower displayed frames.
A graphics system without v-sync has the best responsiveness, as demonstrated in
Therefore, the v-sync of prior art solves the tearing artifacts, however it suffers from two major drawbacks: (i) performance penalties binding FPS to the screen refresh rate, and (ii) input lag that reduces the application's responsiveness. These two shortfalls are critical in real-time graphics applications.
Vertical synchronization (v-sync) in prior art prevents video tearing artifacts by keeping the frame rate of the rendering engine equal to the monitor's refresh rate, if the frame rate originally tends to be lower or higher. However, this technique suffers from two substantial shortcomings: performance limitation and input lag, both of which are critical drawbacks in real-time applications such as video games.
The virtual vertical-synchronization (Virtual V-sync) of the present invention removes the performance shortfall by virtually allowing any frame-per-second rate, independent of the monitor refresh rate, and eliminates the input lag by removing frame blocking. The method is based on preventing excessive application-generated frames from being displayed; instead, the unpresented frames are dropped, or shortened first and then dropped. In order to eliminate artifacts caused by missing frames, inter-frame dependency is resolved.
The virtual vertical-synchronization method of some embodiments of the invention can work with any off-the-shelf GPU and computing system, independently of GPU make, model or size. The virtual vertical-synchronization of the present invention is the basis for two additional aspects of the invention: power consumption control of graphics systems and improved GPU utilization in cloud-based real-time graphics applications, such as cloud gaming.
There is provided, in accordance with an embodiment of the present invention, a method for rendering frames in graphic systems including not displaying at least one frame in a sequence of frames.
According to an embodiment of the present invention, the method further includes determining an amount of time required to finish displaying a rendered frame being currently displayed in the sequence of frames.
According to an embodiment of the present invention, the method further includes determining an amount of time required to render the at least one frame.
According to an embodiment of the present invention, the method further includes not displaying the at least one frame when a time difference between the amount of time required to render the at least one frame and the amount of time required to finish displaying the rendered frame being currently displayed exceeds a predetermined time.
According to an embodiment of the present invention, the method further includes evaluating inter-frame dependency between the at least one frame and a successive one or more frames in the sequence of frames.
According to an embodiment of the present invention, the method further includes shortening the at least one frame if inter-frame dependency exists with the successive one or more frames.
According to an embodiment of the present invention, the shortening includes removing from the at least one frame some rendering commands.
According to an embodiment of the present invention, the method further includes shortening the at least one frame prior to the not displaying.
According to an embodiment of the present invention, the not displaying includes removing from the at least one frame all rendering commands.
According to an embodiment of the present invention, the not displaying includes discarding the at least one frame following its rendering.
There is provided, in accordance with an embodiment of the present invention, a computing device including a CPU (central processing unit) to manage rendering of frames associated with graphics context and to issue an instruction to not display at least one frame in a sequence of frames; and a GPU (graphics processing unit) to render frames in the sequence of frames.
According to an embodiment of the present invention, the CPU determines an amount of time required to finish displaying a rendered frame being currently displayed.
According to an embodiment of the present invention, the CPU further determines an amount of time required to render the at least one frame.
According to an embodiment of the present invention, the CPU issues the instruction to not display responsive to a time difference between the amount of time required to render the at least one frame and the amount of time required to finish displaying the rendered frame being currently displayed exceeding a predetermined time.
According to an embodiment of the present invention, the CPU evaluates inter-frame dependency between the at least one frame and one or more successive frames in said sequence of frames.
According to an embodiment of the present invention, the CPU shortens the at least one frame if inter-frame dependency exists with the one or more successive frames.
According to an embodiment of the present invention, the shortening includes the CPU removing from the at least one frame some rendering commands.
According to an embodiment of the present invention, the CPU shortens the at least one frame prior to issuing the instruction to not display.
According to an embodiment of the present invention, the instruction to not display includes the CPU removing from the at least one frame all rendering commands.
According to an embodiment of the present invention, the instruction to not display includes the CPU discarding the at least one frame following its rendering by the GPU.
For a more complete understanding of practical applications of the embodiments of the present invention, the following detailed description of the illustrative embodiments can be read in conjunction with the accompanying drawings, briefly described below:
b. Prior art. The effect of tearing.
c. Prior art. The method of v-sync.
d. Prior art. Responsiveness of graphics system without v-sync mechanism.
e. Prior art. Deteriorated responsiveness to user input, due to frame blocking by the application.
a. Flowchart of the ‘basic’ mode of Virtual Vsync.
b. Frame sequence of Virtual Vsync.
a. Flowchart of the ‘hybrid’ mode of Virtual Vsync.
b. Frame sequence of the hybrid mode of the Virtual Vsync.
a. Flowchart of the ‘concise’ mode of Virtual Vsync.
b. The ‘concise’ mode of Virtual Vsync.
c. The principle of inter-frame dependency.
d. Various cases of inter-frame dependency.
a. Responsiveness of the basic mode of the Virtual Vsync method.
b. Comparison chart of responsiveness: prior art's standard v-sync vs. one embodiment of present invention's Virtual Vsync.
a. Implementation on a discrete GPU system.
b. Implementation on a dual GPU system.
The virtual vertical-synchronization (Virtual V-sync) of the different embodiments of present invention removes performance shortfalls by virtually allowing any high rate of frame-per-second, independent of the monitor refresh rate, and eliminates the input lag by removing the frame blocking mechanism. The term “monitor refresh rate” is the number of times in a second that display hardware draws the data. This is distinct from the measure of “application frame rate” of how often the application driving the graphics system can feed an entire frame of new data to a display. In case of an application frame rate that is higher than refresh rate, the “actual frame rate” of the graphics system is that of monitor refresh rate. In embodiment of present invention the excessive application frames, above the refresh rate, are assigned as “to-be-dropped” frames. These frames are dropped without rendering, or rendered only partly in case of inter-frame dependency, as explained hereinafter. “Frame blocking” refers to keeping the rendered frame on hold, until the display hardware completes displaying the previous frame. Frame blocking causes input lags, deteriorating graphics system responsiveness.
There are three embodiments of the present invention; (i) the basic mode in which the subsequent frame is generated by the application, then at the time of display it is displayed or dropped, depending on screen availability. (ii) the hybrid mode, where the subsequent frame is generated, but its display depends on the time remaining for the currently displayed frame, and (iii) the concise-frame mode where the time remaining for the currently displayed frame is assessed in advance, and the immediate drop of a fully generated frame is replaced by creating a concise frame with reduced number of draw calls, which is then dropped. In the following description the term BF relates to subsequent back buffer generated frames, and FF stands for front buffer frames displayed at a restricted refresh rate.
a shows a flowchart of the Virtual V-sync basic mode. In this mode all the BF are unconditionally generated, as if all are going to be displayed. A frame, when completed, is either sent to display or dropped without being displayed, depending on screen availability. No frames are blocked by the application. Tearing is eliminated because the undropped frames are never presented to display in the middle of the current FF; they always start a new screen scan upon termination of previous one, beginning from starting point on the screen. Consequently, the FPS performance is high and at the level of a non-v-sync unlimited frame rate, but without the tearing artifacts. This is clearly illustrated in
The hybrid mode, based on controlled frame blocking, allows higher FPS than the prior art v-sync. It is flowcharted in
The concise frame mode is based on shortening before dropping or dropping without shortening the undisplayed frames, allowing higher FPS. The screen availability upon completion of BF is assessed in advance. As shown in the flowchart of
A frame becomes subject to inter-frame dependency if a graphics entity (e.g. texture resource) created as part of the frame, by means of a render target task (herein termed shortly as ‘task’), evoked by a draw call, becomes a source of reference to successive frames. Inter-frame dependency is illustrated in
A frame can be seen as a flow of tasks (T1-T2-T3- . . . TN), when each task has its input and output resources: vertex buffer (VB), index buffer (IB), texture, render target (RT), shaders, and states. An output B of task Tk at frame N is used as an input to task T1 of frame N+1. If that input B is missing, the result is an artifact. For that reason, at the time of formation, inter-frame dependency between tasks must be revealed and solved in order to prevent artifacts.
Practically speaking, there are two different methods to deal with the inter-frame dependency issue. The simple one is a “per application” method based on an individual investigation of each application, making a list of all resources that ought to be provided by one of the preceding frames. The tasks that generate those resources shouldn't be dropped. However, this is a customization method; it is manual and expensive. It requires a human learning curve for each application. Consequently, an automatic method for solving inter-frame dependency is needed.
In one embodiment of the present invention the automatic method for solving inter-frame dependency is based on a Dependency Handler software module, responsible for preventing artifacts caused by frame dependency. For every resource, the module must identify the updating task. Whenever a dependency exists, it must make sure that the successive frames received all the required resources. This is done by keeping the updating task as part of the concise frame, while other draw calls can be removed. The resource is then generated, and from this point on the resource becomes available to all successive frames.
d shows different cases of inter-frame dependency. Resources of successive frames are shown. A resource in frame 1 is set by the command Set Render Target. In frame 2 this resource is called up by the command Set Texture. It is essential to verify in frame 2 whether the called up resource is dependent on the previous frame or not. Case 1 is a simple example of dependency when the final result in frame 2 depends on the drawn element in the preceding frame. In case 1 a small rectangle was created by the first frame. In the next frame the dependency disappears only if the rectangle is completely overdrawn. In case 1 the original rectangle appears in the final result as well, which makes the second frame dependent on the first one. In case 2 the triangle overwrites the rectangle, removing the dependency.
The difficulty stems from the need to recognize in real time whether the overwriting was complete or not. In case 3 the answer is made simple because of overwriting by a full square quad or in case 6 a Clear command, removing any chance for dependency. In case 4 the full squad is assembled from a puzzle of smaller polygons, which raises uncertainty. If the polygons fully cover the texture, no dependency exists. The occlusion query command, counting the number of drawn pixels, can help. However, if the texture is not completely covered, the dependency is questionable: both options still exist. Case 5 shows an example of incomplete overdraw, leaving the dependency in place. In the case of uncertainty, we need to take a “false positive” approach, meaning that we must assume dependency, in order to eliminate any chance of artifacts.
Some embodiments of the present invention minimize input lags to the level of graphics systems without v-sync solutions. As mentioned before, input lags deteriorate the responsiveness of real-time graphics systems, interfering with games which require precise timing or fast reaction times. The high responsiveness of the Virtual v-sync method of the embodiment of the present invention is illustrated in
An additional way to improving responsiveness in some embodiments is by shortening the queue of driver-formed frames in the CPU. The frames are queued prior being sent to the GPU. The typical queue length in a CPU is of three frames, with no blocking, causing a constant input lag. This lag can be shortened by decreasing the queue to one or two driver-formed frames.
In summary, the different embodiments prevent video tearing artifacts, performance limitations and input lag in graphics systems, all of which are critical in real-time applications.
Micro stuttering is inherent in every technique of dropping frames in a non-uniform way. Typically, micro stuttering is a term used in computing to describe a quality defect inherent in multi-GPU configurations, using time division (alternate frame rate). It manifests as irregular delays between frames rendered by the multiple GPUs. This effect may be apparent when the flow of frames appears to stutter, resulting in a degraded game play experience in video games, even though the frame rate seems high enough to provide a smooth experience.
In different embodiments, when the shortening and dropping frames are practiced, a micro stuttering may appear. It causes two deteriorating effects: (i) a non fluent image (stuttering image) when the animated contents do not develop smoothly, and (ii) a non-uniform pace of displaying frames (stuttering display). The stuttering of an image stems from the discrepancy caused to the virtual timeline at the animated application by missing frames from the timely sequence. The virtual time must then be compensated accordingly, to eliminate image stuttering.
In summary, in order to prevent stuttering of display as well as of image, the application clock must be controlled by timely submissions of frames to GPU, and timely presents to display. Same method should be applied for mouse and keyboard movements. Mouse and Keyboard movements should be manipulated to fit the actual presented frames in the same way as the applications clock was controlled.
Another embodiment of the present invention matches the cloud gaming application. Cloud gaming is a type of online gaming that allows direct and on-demand streaming of games onto a computer through the use of a thin client, in which the actual game is stored on the operator's or game company's server and is streamed directly to computers accessing the server through the client. This makes the capability of the user's computer unimportant, as the server is handling the processing needs. The controls and button presses from the user are transmitted directly to the server, where they are recorded, and the server then sends back the game's response to the input controls.
High utilization of the GPU in cloud gaming is of significant importance. The more applications a GPU can run simultaneously, the higher its utilization. It is gained by usage of the concise mode along with the solution of inter-frame dependency, as described above.
The graphics subsystem of a computing system is typically the largest power consumer. The power dissipated by a graphics subsystem is proportional to the frame rate: P=C*FPS, where P is the dissipated power and C is the heat capacitance. As FPS changes, the power follows the change in a linear way. Lowering FPS decreases the power consumption. Unfortunately, this decrease in power consumption comes at the price of derogated responsiveness, due to a slower FPS. For that reason a real-time power-performance tradeoff must be kept. The capability of controlled FPS suggests a dynamic way of doing this: the dynamic FPS scaling mechanism, whereby the FPS of a graphics subsystem can be automatically adjusted “on the fly,” either lowered to conserve power and reduce the amount of heat generated at the cost of responsiveness, or increased to improve the responsiveness. Such a dynamic FPS scaling would be important in laptops, tablets and other mobile devices, where energy comes from a battery and thus is limited. It can also be used in quiet computing settings that need low noise levels, such as video editing, sound mixing, home servers, and home theater PCs. A typical quiet PC uses quiet cooling and storage devices and energy-efficient parts. Less heat output, in turn, allows the system cooling fans to be throttled down or turned off, reducing noise levels and further decreasing power consumption.
In some embodiments of the present invention, the capability of altering the FPS is applied to controlling the power consumption of the system. In concise mode the FPS is raised by dropping some frames or cutting parts thereof. As a result, when at a given FPS, the GPU power consumption in concise mode is compared with the GPU power consumption in native mode; the consumption at concise mode is apparently lower, saving power. This is evident from the table of
Unfortunately, the total power consumption does not drop in the same ratio, because of the second largest power consumer in the system, the CPU. Following the increased FPS, the CPU needs to work harder, preparing more frames per time unit for the GPU, resulting in intensified power consumption. This is evident from
The way to save the power gain of the GPU is by artificially reducing the power consumption of the CPU, without interfering with the CPU's work on behalf of graphics. Usually, each frame processed by the GPU has to be pre-processed by the CPU, transferred to the GPU for rendering, and finally sent from the GPU to display by the Present call. The frame rendering period at the GPU overlaps with the CPU pre-processing of the successive frame. Typically, the pre-processing time at the CPU is shorter, terminating at some time before the present call, resulting in a CPU idle period. According to an embodiment of present invention the CPU is shut down during that idle period, by an issued Sleep(X MS) command (also called CPU bubbles). This is shown in
The preferred embodiment of Virtual V-sync of the present invention comprises GPU-related graphics contexts, and CPU-related tasks to manage the graphics contexts. There are two graphics contexts:
The Rendering Context is managed by a series of CPU tasks: (i) decision making on dropping frames or shortening frames (ii) testing inter-frame dependencies, (iii) modifying frames accordingly, (iv) feeding the GPU with data and commands, and (v) transferring the final back buffers to presenting frames. A series of tasks are required to manage the Display Context: (i) receive rendered frames from the rendering context. (ii) Managing the back buffers swap chain, and (iii) controlling the Display Sync.
a and 10b demonstrate two preferred system embodiments of the present invention, based on off-the-shelf components, such as multicore chips, CPU and GPU fusion chips, discrete GPUs, etc.
This application is a continuation application claiming benefit from U.S. patent application Ser. No. 13/437,869 filed 2 Apr. 2012, which claimed priority from U.S. Provisional Application No. 61/471,154 filed 3 Apr. 2011 and which is hereby incorporated in its entirety by reference.
Number | Date | Country | |
---|---|---|---|
61471154 | Apr 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13437869 | Apr 2012 | US |
Child | 14302439 | US |