Program sequencer for generating indeterminant length shader programs for a graphics processor

Information

  • Patent Grant
  • 8659601
  • Patent Number
    8,659,601
  • Date Filed
    Wednesday, August 15, 2007
    17 years ago
  • Date Issued
    Tuesday, February 25, 2014
    10 years ago
Abstract
A method for loading and executing an indeterminate length shader program. The method includes accessing a first portion of a shader program in graphics memory of a GPU and loading instructions from the first portion into a plurality of stages of the GPU to configure the GPU for program execution. A group of pixels is then processed in accordance with the instructions from the first portion. A second portion of the shader program is accessed in graphics memory of the GPU and instructions from the second portion are loaded into the plurality of stages of the GPU to configure the GPU for program execution. The group of pixels are then processed in accordance with the instructions from the second portion.
Description
CROSS REFERENCE TO RELATED APPLICATIONS

This application is related to the following U.S. patent applications: “FRAGMENT SPILL/RELOAD FOR A GRAPHICS PROCESSOR”, by Mahan et al., filed on Aug. 15, 2007, 11/893502; “SHADER PROGRAM INSTRUCTION FETCH”, by Mahan et al., filed on Aug. 15, 2007, 11/893503; and “SOFTWARE ASSISTED SHADER MERGING”, by Mahan et al., filed on Aug. 15, 2007, 11/893439.


FIELD OF THE INVENTION

The present invention is generally related to programming graphics computer systems.


BACKGROUND OF THE INVENTION

Recent advances in computer performance have enabled graphic systems to provide more realistic graphical images using personal computers, home video game computers, handheld devices, and the like. In such graphic systems, a number of procedures are executed to “render” or draw graphic primitives to the screen of the system. A “graphic primitive” is a basic component of a graphic picture, such as a point, line, polygon, or the like. Rendered images are formed with combinations of these graphic primitives. Many procedures may be utilized to perform 3-D graphics rendering.


Specialized graphics processing units (e.g., GPUs, etc.) have been developed to optimize the computations required in executing the graphics rendering procedures. The GPUs are configured for high-speed operation and typically incorporate one or more rendering pipelines. Each pipeline includes a number of hardware-based functional units that are optimized for high-speed execution of graphics instructions/data. Generally, the instructions/data are fed into the front end of the pipeline and the computed results emerge at the back end of the pipeline. The hardware-based functional units, cache memories, firmware, and the like, of the GPU are optimized to operate on the low-level graphics primitives and produce real-time rendered 3-D images.


In modern real-time 3-D graphics rendering, the functional units of the GPU need to be programmed in order to properly execute many of the more refined pixel shading techniques. These techniques require, for example, the blending of colors into a pixel in accordance with factors in a rendered scene which affect the nature of its appearance to an observer. Such factors include, for example, fogginess, reflections, light sources, and the like. In general, several graphics rendering programs (e.g., small specialized programs that are executed by the functional units of the GPU) influence a given pixel's color in a 3-D scene. Such graphics rendering programs are commonly referred to as shader programs, or simply shaders. In more modern systems, some types of shaders can be used to alter the actual geometry of a 3-D scene (e.g., Vertex shaders) and other primitive attributes.


In a typical GPU architecture, each of the GPU's functional units is associated with a low level, low latency internal memory (e.g., register set, etc.) for storing instructions that programmed the architecture for processing the primitives. The instructions typically comprise shader programs and the like. The instructions are loaded into their intended GPU functional units by propagating them through the pipeline. As the instructions are passed through the pipeline, when they reach their intended functional unit, that functional unit will recognize its intended instructions and store them within its internal registers.


Prior to being loaded into the GPU, the instructions are typically stored in system memory. Because the much larger size of the system memory, a large number of shader programs can be stored there. A number of different graphics processing programs (e.g., shader programs, fragment programs, etc.) can reside in system memory. The programs can each be tailored to perform a specific task or accomplish a specific result. In this manner, the graphics processing programs stored in system memory act as a library, with each of a number of shader programs configured to accomplish a different specific function. For example, depending upon the specifics of a given 3-D rendering scene, specific shader programs can be chosen from the library and loaded into the GPU to accomplish a specialized customized result.


The graphics processing programs, shader programs, and the like are transferred from system memory to the GPU through a DMA (direct memory access) operation. This allows GPU to selectively pull in the specific programs it needs. The GPU can assemble an overall graphics processing program, shader, etc. by selecting two or more of the graphics programs in system memory and DMA transferring them into the GPU.


A problem exists, however, in that the overall length of any given shader program is limited by the hardware resources of the GPU architecture. For example, as the GPU architecture is designed, its required resources are determined by the expected workload of the applications they architectures targeted towards. Based on the expected workload, design engineers include a certain number of registers, command table memory of a certain amount, cache memory of a certain amount, FIFOs of a certain depth, and the like. This approach leads to a built-in inflexibility. The so-called “high-end” architectures often include excessive amounts of hardware capability which goes unused in many applications. This leads to excessive cost, excessive power consumption, excessive heat, and the like. Alternatively, the so-called “midrange” or “low-end” architectures often have enough hardware capability to handle most applications, but cannot properly run the more demanding high-end applications. Unfortunately, these high-end applications often have the most compelling and most interactive user experiences, and thus tend to be popular and in high demand.


Thus, a need exists for graphics architecture that implements a shader program loading and execution process that can scale as graphics application needs require and provide added performance without incurring penalties such as increased processor overhead.


SUMMARY OF THE INVENTION

Embodiments of the present invention provide a graphics architecture that implements a shader program loading and execution process that can scale as graphics application needs require and provide added performance without incurring penalties such as increased processor overhead.


In one embodiment, the present invention is implemented as a method for loading and executing an indeterminate length shader program (e.g., normal length shader programs, long shader programs, very long shader programs, etc.). To execute shader programs of indeterminate length, embodiments of the present invention execute such indeterminate length shader programs by executing them in portions. The method includes accessing a first portion of the shader program in graphics memory of the GPU and loading instructions from the first portion into a plurality of stages (e.g., ALU, etc.) of the GPU to configure the GPU for program execution. A group of pixels (e.g., a group of pixels covered by one or more primitives, etc.) is then processed in accordance with the instructions from the first portion. A second portion of the shader program is then accessed in graphics memory and instructions from the second portion are loaded into the plurality of stages to configure the GPU for program execution. The group of pixels are then processed in accordance with the instructions from the second portion. Accordingly, the shader program can comprise more than two portions. In such a case, for each of the portions, the GPU can process the group of pixels by loading instructions for the portion and executing instructions for that portion, and so on until all the portions comprising the shader program are executed.


In one embodiment, the GPU stores the group of pixels in graphics memory subsequent to program execution for the first portion. Subsequent to loading instructions from the second portion, the GPU accesses the group of pixels to perform program execution for the second portion. In one embodiment, the plurality of stages of the GPU are controlled by a program sequencer unit. The program sequencer unit functions by configuring the GPU to load instructions for the shader program and execute the shader program. In one embodiment, the program sequencer unit executes a state machine that coordinates the operation of the GPU.


In one embodiment, the state machine executing on the program sequencer unit controls the plurality of stages of the GPU to implement a recirculation of the group of pixels through the plurality of stages of the GPU for a complex graphics processing operation. The recirculation can be used to implement the complex graphics processing where more than one pass through the GPU pipeline is required. In one embodiment, the state machine controls the plurality of stages of the GPU to implement a basic graphics processing operation by implementing a single pass (e.g., one pass through the pipeline) of the group of pixels through the plurality of stages. The single pass-through mode can be used to implement comparatively basic graphics processing where, for example, a shader can be executed using a single pass through the stages of the GPU.





BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and not by way of limitation, in the Figures of the accompanying drawings and in which like reference numerals refer to similar elements.



FIG. 1 shows a computer system in accordance with one embodiment of the present invention.



FIG. 2 shows a diagram 200 illustrating internal components of a GPU and a graphics memory in accordance with one embodiment of the present invention.



FIG. 3 shows a diagram of a system memory and a plurality of functional modules of a graphics pipeline in accordance with one embodiment of the present invention.



FIG. 4 shows a diagram of the internal components of the program sequencer in accordance with one embodiment of the present invention.



FIG. 5 shows a diagram of the internal components of a command unit in accordance with one embodiment of the present invention.



FIG. 6 shows a diagram of a state machine executing on the program sequencer in accordance with one embodiment of the present invention.





DETAILED DESCRIPTION OF THE INVENTION

Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the preferred embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of embodiments of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be recognized by one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments of the present invention.


Notation and Nomenclature:


Some portions of the detailed descriptions, which follow, are presented in terms of procedures, steps, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A procedure, computer executed step, logic block, process, etc., is here, and generally, conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.


It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing terms such as “processing” or “accessing” or “executing” or “storing” or “rendering” or the like, refer to the action and processes of a computer system (e.g., computer system 100 of FIG. 1), or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.


Computer System Platform:



FIG. 1 shows a computer system 100 in accordance with one embodiment of the present invention. Computer system 100 depicts the components of a basic computer system in accordance with embodiments of the present invention providing the execution platform for certain hardware-based and software-based functionality. In general, computer system 100 comprises at least one CPU 101, a system memory 115, and at least one graphics processor unit (GPU) 110. The CPU 101 can be coupled to the system memory 115 via a bridge component/memory controller (not shown) or can be directly coupled to the system memory 115 via a memory controller (not shown) internal to the CPU 101. The GPU 110 is coupled to a display 112. One or more additional GPUs can optionally be coupled to system 100 to further increase its computational power. The GPU(s) 110 is coupled to the CPU 101 and the system memory 115. System 100 can be implemented as, for example, a desktop computer system or server computer system, having a powerful general-purpose CPU 101 coupled to a dedicated graphics rendering GPU 110. In such an embodiment, components can be included that add peripheral buses, specialized graphics memory, IO devices, and the like. Similarly, system 100 can be implemented as a handheld device (e.g., cellphone, etc.) or a set-top video game console device such as, for example, the Xbox®, available from Microsoft Corporation of Redmond, Wash., or the PlayStation3®, available from Sony Computer Entertainment Corporation of Tokyo, Japan.


It should be appreciated that the GPU 110 can be implemented as a discrete component, a discrete graphics card designed to couple to the computer system 100 via a connector (e.g., AGP slot, PCI-Express slot, etc.), a discrete integrated circuit die (e.g., mounted directly on a motherboard), or as an integrated GPU included within the integrated circuit die of a computer system chipset component (not shown), or within the integrated circuit die of a PSOC (programmable system-on-a-chip). Additionally, a local graphics memory 114 can be included for the GPU 110 for high bandwidth graphics data storage.


EMBODIMENTS OF THE INVENTION


FIG. 2 shows a diagram 200 illustrating internal components of the GPU 110 and the graphics memory 114 in accordance with one embodiment of the present invention. As depicted in FIG. 2, the GPU 110 includes a graphics pipeline 210 and a fragment data cache 250 which couples to the graphics memory 114 as shown.


The FIG. 2 embodiment illustrates exemplary components of the GPU 110 that implement the method and system for loading and executing indeterminate length shader programs. The GPU 110 includes the graphics pipeline 210 which includes a number of functional modules. Three such functional modules of the graphics pipeline 210, for example the program sequencer 220, the ALU 230, and the data write component 240, function by rendering graphics primitives that are received from a graphics application (e.g., from a graphics driver, etc.). The functional modules 220-240 access information for rendering the pixels related to the graphics primitives via the fragment data cache 250. The fragment data cache 250 functions as a high-speed cache for the information stored in the graphics memory 114 (e.g., the frame buffer memory).


The program sequencer 220 functions by controlling the operation of the functional modules of the graphics pipeline 210. The program sequencer 220 can interact with the graphics driver (e.g., a graphics driver executing on the CPU 101) to control the manner in which the functional modules of the graphics pipeline 210 receive information, configure themselves for operation, and process graphics primitives. For example, in the FIG. 2 embodiment, graphics rendering data (e.g., primitives, triangle strips, etc.), pipeline configuration information (e.g., mode settings, rendering profiles, etc.), and rendering programs (e.g., pixel shader programs, vertex shader programs, etc.) are received by the graphics pipeline 210 over a common input 260 from an upstream functional module (e.g., from an upstream raster module, from a setup module, or from the graphics driver). The input 260 functions as the main fragment data pathway, or pipeline, between the functional modules of the graphics pipeline 210. Primitives are generally received at the “top” of the pipeline and are progressively rendered into resulting rendered pixel data as they proceed from one module to the next along the pipeline.


In one embodiment, data proceeds between the functional modules 220-240 in a packet based format. For example, the graphics driver transmits data to the GPU 110 in the form of data packets. Data packets that send configuration and programming information are referred to as register packets. Data packets are specifically configured to interface with and be transmitted along the fragment pipe communications pathways of the pipeline 210. Data packets that send pixel data and/or rendering attributes are referred to as pixel packets. Data packets that send pixel data and/or rendering attributes that are common for a group of pixels are referred to as span header packets. The data packets generally includes information regarding a group or tile of pixels (e.g., 4 pixels, 8 pixels, 16 pixels, etc.) and coverage information for one or more primitives that relate to the pixels. The data packets can also include configuration information that enables the functional modules of the pipeline 210 to configure themselves for rendering operations. For example, the data packets can include configuration bits, instructions, functional module addresses, etc. that that can be used by one or more of the functional modules of the pipeline 210 to configure itself for the current rendering mode, or the like. In addition to pixel rendering information and functional module configuration information, the data packets can include shader program instructions that program the functional modules of the pipeline 210 to execute shader processing on the pixels. For example, the instructions comprising a shader program can be transmitted down the graphics pipeline 210 and be loaded by one or more designated functional modules. Once loaded, during rendering operations, the functional module can execute the shader program on the pixel data to achieve the desired rendering effect.


In this manner, the highly optimized and efficient fragment pipe communications pathway implemented by the functional modules of the graphics pipeline 210 can be used not only to transmit pixel data between the functional modules (e.g., modules 220-240), but to also transmit configuration information and shader program instructions between the functional modules.


Referring still to FIG. 2, in the present embodiment, the program sequencer 220 functions by controlling the operation of the other components of the graphics pipeline 210 and working in conjunction with the graphics driver to implement a method for loading and executing an indeterminate length shader program. As used herein, the term “indefinite length” or “indeterminate length” shader program refers to the fact that the shader programs that can be executed by the GPU 110 are not arbitrarily limited by a predetermined, or format based, length. Thus for example, shader programs that can be executed can be short length shader programs (e.g., 16 to 32 instructions long, etc.), normal shader programs (e.g., 64 to 128 instructions long, etc.), long shader programs (e.g., 256 instructions long, etc.), very long shader programs (e.g., more than 1024 instructions long, etc) or the like.


To execute shader programs of indeterminate length, the program sequencer 220 controls the graphics pipeline 210 to execute such indeterminate length shader programs by executing them in portions. The program sequencer 220 accesses a first portion of the shader program from the system memory 114 and loads the instructions from the first portion into the plurality of stages of the pipeline 210 (e.g., ALU 230, data write component 240, etc.) of the GPU 110 to configure the GPU 110 for program execution. As described above, the instructions for the first portion can be transmitted to the functional modules of the graphics pipeline 210 as pixel packets that propagate down the fragment pipeline. A group of pixels (e.g., a group of pixels covered by a one or more primitives, etc.) is then processed in accordance with the instructions from the first portion. A second portion of the shader program is then accessed (e.g., DMA transferred in from the system memory 115) and instructions from the second portion are then loaded into the plurality of stages of the pipeline 210.


The group of pixels are then processed in accordance with the instructions from the second portion. In this manner, multiple shader program portions can be accessed, loaded, and executed to perform operations on the group of pixels. For example, for a given shader program that comprises a hundred or more portions, for each of the portions, the GPU can process the group of pixels by loading instructions for the portion and executing instructions for that portion, and so on until all the portions comprising the shader program are executed. This attribute enables embodiments of the present invention to implement the indefinite length shader programs. As described above, no arbitrary limit is placed on the length of a shader program that can be executed.



FIG. 3 shows a diagram of the system memory 115 and a plurality of functional modules 321-324 of a graphics pipeline 315 in accordance with one embodiment of the present invention. As depicted in FIG. 3, the system memory 115 includes a plurality of instruction block images 301-304. These instruction block images are basically small shader programs that can be assembled, using two or more such small shader programs, into a large shader program. In this manner, the instruction block images 301-304 can be viewed as portions of a large shader program, as described above. A portion of a shader program is at times referred to as an epoch.


The functional modules 321-324 are typical functional modules of the 3-D graphics rendering pipeline 315 (e.g., setup unit, raster unit, texturing unit, etc.). The 3-D graphics rendering pipeline 315 comprises a core component of a GPU, and can be seen as more basic diagram of the pipeline 210 from FIG. 2. The FIG. 3 embodiment shows the plurality of instruction block images 301-304 that can be DMA transferred into the pipeline 315 for use by one or more of the functional modules 321-324. The instruction block images 301-304 are configured to implement particular graphics rendering functions. In one embodiment, they are stored within the system memory 115 and are DMA transferred, via the DMA unit 331, when they are needed to perform graphics rendering operations on a group of pixels (e.g., being rendered on the GPU 110).


Each of the instruction block images 301-304 comprise a graphics rendering epoch that programs the hardware components of, for example, a functional unit 322 (e.g., an ALU unit, etc.) to perform a graphics rendering operation. A typical instruction block image (e.g., instruction block image 301) comprises a number of instructions. In the DMA transfer embodiment described above, the instruction block images are stored in system memory 115 and are maintained there until needed by a graphics application executing on the GPU 110.


The FIG. 3 embodiment indicates that a large number of instruction block images can be stored within the system memory 115 (e.g., 50 instruction block images, 200 instruction block images, or more). The large number of different instruction block images allows flexibility in fashioning specific graphics rendering routines that are particularly suited to the needs of a graphics application. For example, indeterminate length complex rendering routines can be fashioned by arranging multiple smaller rendering programs to execute in a coordinated fashion (e.g., arranging two or more shader programs to execute sequentially one after the other).


In one embodiment, the GPU 110 stores the group of pixels in graphics memory 114 subsequent to program execution for the first portion. This clears the fragment data pipe and the pipeline stages to be used for loading instructions for the second portion. Subsequent to loading instructions from the second portion, the GPU 110 accesses the group of pixels in the graphics memory 114 to perform program execution for the second portion. In this manner, the program sequencer 220 loads the first portion of the shader program, processes the group of pixels in accordance with the first portion, temporarily stores the intermediate result in the graphics memory 114, loads the second portion of the shader program, retrieves the intermediate results from the graphics memory 114 and processes the group of pixels (e.g., the intermediate results) in accordance with the second portion. This process can be repeated until all portions of a indeterminate length shader program have been executed and the group of pixels have been complete a processed. The resulting processed pixel data is then transferred to the graphics memory 114 for rendering onto the display (e.g., display 112).



FIG. 4 shows a diagram of the internal components of the program sequencer 220 in accordance with one embodiment of the present invention. As depicted in FIG. 4, the program sequencer 220 includes an interface with the data write module 411 and a command unit 412 coupled to an upper logic unit 413. The upper logic unit 413 is coupled to the FIFO 414 and a lower logic unit 415 as shown.


As described above, the program sequencer 220 functions by managing the data flow through the pipeline 210 and managing the program execution of the functional modules of the pipeline 210. For example, the program sequencer 220 sets up the pipeline 210 for multiple pass or single pass operation, where a given group of pixels are passed through one or more functional modules of the pipeline 210 multiple times to implement more complex shader processing, or the group of pixels is passed through the functional modules of the pipeline 210 a single time to implement comparatively straightforward, less complex shader processing. The operation of the program sequencer 220 is controlled by a command unit 412 which executes a state machine that coordinates the operation of the program sequencer 220, and thus controls the operation of the GPU 110.


The upper logic unit 413 functions by receiving incoming pixel packets (e.g., from an upstream set up module or from a graphics driver). The upper logic unit examines the data/instructions of the received pixel packets and initiates any requests to memory that may be needed to process the data/instructions of the pixel packets. These memory requests are provided to the fragment that cache 250 which then fetches the requested data from the graphics memory 114 as shown in FIG. 4. The returned data comes from the graphics memory 114, through the fragment data cache 250, to the lower logic unit 415.


The lower logic unit 415 functions by merging the returning data from the graphics memory 114 into the data/instructions of the data packets that were received by the upper logic unit 413. The received pixel packets are passed from the upper logic unit 413 into a FIFO 414. The pixel packets then sequentially passed through the many stages of the FIFO 414 and emerge at the bottom of the FIFO 414 and are received by the lower logic unit 415. In this manner, the FIFO 414 functions by hiding the latency incurred between requests to the fragment data cache 250 and the graphics memory 114 and the return of the requested data from the graphics memory 114 to the lower logic unit 415. The lower logic unit 415 then merges the returned data with the originating data packet. In one embodiment, the FIFO 414 has sufficient number of stages and is designed in such manner as to allow the insertion of one pixel packet each clock at the top and the output of one pixel packet each clock at the bottom.


The FIG. 4 embodiment also shows a feedback path that proceeds from the lower logic unit 415 back up to the upper logic unit 413. This feedback from the lower logic unit 415 to the upper logic unit 413 accommodates those situations where the lower logic unit 415 generates register writes and sends those register writes back up to the upper logic unit 413. Register writes refer to that case where the instructions, configuration bits, or the like returned from the graphics memory 114 need to be written into one or more control registers of the command unit 412 to program the program sequencer 220. The register writes are then provided by the upper logic unit 413 to the command unit 412.


The FIG. 4 embodiment shows a set of outputs generated by the data write interface. In this implementation, the data write interface unit 411 functions by generating outputs that control the functioning of the other functional modules of the graphics pipeline 210. As shown in FIG. 4, the outputs comprise a read request (e.g., Rd req), a scoreboard clear request (e.g., SB clr), and a triangle done signal (e.g., Tri done). The data write interface unit 411 generates the outputs in accordance with instructions received from the command unit 412 (e.g., which executes the program sequencer state machine).


The read request output (e.g., Rd req) provides a path for outputs of a particular register to flow out of the program sequencer unit 220 and to circulate to other functional modules of the graphics pipeline 210 (e.g., can circulate out to the data write module 240, etc.) until the data returns to the program sequencer 220. The scoreboard clear output (e.g., SB clr) is used to coordinate access to the graphics memory 114 to manage read-modify-write data hazards that may occur when, for example, pixel data is updated by rendering processes performed by the graphics pipeline 210 but the memory location containing the pixel data is read by a functional module before the update has been fully realized (e.g., before the memory location is changed to reflect the updated pixel data). The triangle done output (e.g., Tri done) functions in a manner similar to the scoreboard clear output. The triangle done output is used to let other functional modules of the graphics pipeline 210 know when a graphics primitive (e.g., triangle) has completed processing through the pipeline 210. This lets the functional modules avoid read-modify-write hazards that can occur when a large number of triangles are in-flight at a time (e.g., 32 triangles in flight, or more).



FIG. 5 shows a diagram of the internal components of the command unit 412 in accordance with one embodiment of the present invention. As illustrated in FIG. 5, the command unit 412 includes a plurality of control registers 501, a program counter 502, and the command table 503, coupled to execution logic 504. The control registers are for accepting and providing configuration information and state information from/to the other functional modules of the graphics pipeline 210. The command table 503 stores the commands that comprise the program instructions of the state machine that executes within the command unit 412. The program counter 502 keeps track of the current state and the next state for the state machine and indexes the instructions of the command table. The execution logic 504 includes the hardware sequential logic that implements the transitions of the state machine, which is shown in FIG. 6 below.



FIG. 6 shows a diagram of a state machine 600 executing on the program sequencer 220 in accordance with one embodiment of the present invention.


The initial state of the state machine 600 is the reset state 601. The reset state 601 is configured to establish a known initial state of the various registers, logic components, and the like of the command unit 412. The reset state 601 is the initial state the command unit 412 comes up in upon an initial power up. The reception of a span start by the program sequencer 220 causes an exit from the reset state 601 to the pass through state 602. A span start comprises a header for an arriving group of pixel packets for a group of pixels that are received by the program sequencer 220. The span start also comprises the initial variables in common information that enables work on the group of pixels to begin.


After reset 601, the state machine 600 transitions to the pass through state 602, where one of two modes of execution is established for the graphics pipeline 210. These modes are either recirculation mode or pass-through mode. Recirculation mode is used when multiple passes through the graphics pipeline 210 are needed in order to implement comparatively complex shader processing. Pass-through mode is used when a single pass through the graphics pipeline 210 can be used to implement comparatively simple shader processing. If the control registers 501 indicate the pass-through mode, the transition 651 allows the state machine 600 to remain in pass-through mode until all the pixels of the pixel packets have been processed. If the pixel packets indicate command table execution or recirculation mode, the state machine moves into the fetch state 603.


The fetch state 603 is where the program sequencer 220 fetches program instructions for execution from the command table 503. These instructions are indexed by the program counter 502. As described above, in one embodiment, the command unit 412 commands the upper logic module 413 to initiate an instruction fetch from the graphics memory 114, which returns the instructions to the lower logic unit 415 and then passes the instructions back to the upper logic unit 413 and into the command unit 412, and into the other functional modules of the graphics pipeline 210. The transition 653 allows the fetch state 603 to repeat until all of the instructions needed for that shader program or shader program portion have been fetched. When the instruction fetch state 603 is complete, the state machine 610 transitions to a gather state 604, an execute start state 605, or an IMM (e.g., insert immediate) state 606.


The gather state 604 is where the program sequencer 220 gathers rendering commands from the graphics memory 114 and/or the system memory 115. These commands are used to load the command table 503 of the command unit 412. The commands also establish the location of the program counter 502. Additionally, the gather state 604 can also load control registers 510 and module instructions (e.g., not shown) in the functional units 321-324, etc. As described above, the state machine 600 is controlled by the command unit 412 executing the command table 503 as indexed by the program counter 502. Once the commands have been gathered, the state machine 600 transitions back to the fetch state 603 as shown.


The insert immediate state 606 is where a register packet is inserted into the pipeline, for updating functional instructions, control registers (like 501), etc. The data for this register packet is in the command instruction (e.g., from the command table 503).


The execute start state 605 is where command table execution in accordance with the program counter 502 begins. As described above, execution of the command table 503 causes the execution of the shader program. The execute start state 605 will transition into either the execute QZ state 607, the execute recirculate state 608, or the execute mop state 609.


The execute QZ state 607 is where the graphics pipeline 210 executes pixels received from the earlier stages of the GPU 110 (e.g., a raster stage, set up stage, or the like) or from the graphics driver, via the pipeline input 260. The pixels flow into the pipeline 210 and are progressively processed by each of the functional modules of the pipeline 210. State machine 600 can proceed from the execute QZ state 607 to the execute recirculate state 608 or the execute done state 610.


The execute recirculate state 608 is where the graphics pipeline 210 executes recirculating pixel data that has already passed through the graphics pipeline 210 at least one time. The state machine 600 can proceed from the execute recirculate state to the execute done state 610.


The execute mop state 609 is where the graphics pipeline 210 fetches pixel data from the graphics memory 114 for shader program execution. The state machine 600 can proceed from the execute mop state 609 to the execute recirculate state 608 or the execute done state 610.


The execute done state 610 is when shader program execution is complete for the particular shader portion, or epoch, is done. The state machine 600 can transition from the execute done state 610 back to the fetch state 603 or back to the pass-through state 602 as shown.


In this manner, the state machine 600 can control the program sequencer unit 220, and in turn control graphics pipeline 210 and the GPU 110, to implement the loading and execution of indeterminate length shader programs. For very long shader programs having a large number of epochs, the process is repeated until all of the epochs have been completed. One or more of the epochs of the very long shader program could involve recirculation one or more times through the graphics pipeline 210 (e.g., via the execute recirculate state 608).


The foregoing descriptions of specific embodiments of the present invention have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.

Claims
  • 1. A method for loading and executing an indeterminate length shader program, comprising: accessing a first portion of a shader program in graphics memory of a GPU;loading instructions from the first portion into a plurality of stages of the GPU to configure the GPU for program execution;processing a group of pixels in accordance with the instructions from the first portion;accessing a second portion of the shader program in graphics memory of the GPU;loading instructions from the second portion into the plurality of stages of the GPU to configure the GPU for program execution; andprocessing the group of pixels in accordance with the instructions from the second portion, wherein the group of pixels are processed in accordance with the per first portion and subsequently processed in accordance with the 2nd portion to avoid shader program length limits and fragment surface count limits.
  • 2. The method of claim 1, further comprising: storing the group of pixels in graphics memory subsequent to program execution for the first portion; andsubsequent to loading instructions from the second portion, accessing the group of pixels to perform program execution for the second portion.
  • 3. The method of claim 1, wherein the plurality of stages of the GPU are controlled by a program sequencer unit, and wherein the program sequencer unit configures the GPU to load instructions for the shader program and execute the shader program.
  • 4. The method of claim 3, wherein the program sequencer unit executes a state machine that coordinates the operation of the GPU.
  • 5. The method of claim 4, wherein the shader program comprises more than two portions, and for each of the portions the program sequencer unit causes the GPU to process the group of pixels in accordance with the portion.
  • 6. The method of claim 4, wherein the state machine controls the plurality of stages of the GPU to implement a recirculation of the group of pixels through the plurality of stages of the GPU for a complex graphics processing operation.
  • 7. The method of claim 4, wherein the state machine controls the plurality of stages of the GPU to implement a pass-through of the group of pixels through the plurality of stages of the GPU for a basic graphics processing operation.
  • 8. A GPU (graphics processing unit) for loading and executing a shader program, comprising: an integrated circuit die comprising a plurality of stages of the GPU;a memory interface for interfacing with a graphics memory;a host interface for interfacing with a computer system; anda program sequencer unit for controlling each of the plurality of stages, wherein the program sequencer causes the GPU to: access a first portion of a shader program in graphics memory of a GPU;load instructions from the first portion into a plurality of stages of the GPU to configure the GPU for program execution;process a group of pixels in accordance with the instructions from the first portion;access a second portion of the shader program in graphics memory of the GPU;load instructions from the second portion into the plurality of stages of the GPU to configure the GPU for program execution; andprocess the group of pixels in accordance with the instructions from the second portion, wherein the group of pixels are processed in accordance with the per first portion and subsequently processed in accordance with the 2nd portion to avoid shader program length limits and fragment surface count limits.
  • 9. The GPU of claim 8, further comprising: storing the group of pixels in graphics memory subsequent to program execution for the first portion; andsubsequent to loading instructions from the second portion, accessing the group of pixels to perform program execution for the second portion.
  • 10. The GPU of claim 8, wherein the plurality of stages of the GPU are controlled by a program sequencer unit, and wherein the program sequencer unit configures the GPU to load instructions for the shader program and execute the shader program.
  • 11. The GPU of claim 10, wherein the program sequencer unit executes a state machine that coordinates the operation of the GPU.
  • 12. The GPU of claim 11, wherein the shader program comprises more than two portions, and for each of the portions the program sequencer unit causes the GPU to process the group of pixels in accordance with the portion.
  • 13. The GPU of claim 11, wherein the state machine controls the plurality of stages of the GPU to implement a recirculation of the group of pixels through the plurality of stages of the GPU for a complex graphics processing operation.
  • 14. The GPU of claim 11, wherein the state machine controls the plurality of stages of the GPU to implement a pass-through of the group of pixels through the plurality of stages of the GPU for a basic graphics processing operation.
  • 15. A handheld computer system device, comprising: a system memory;a CPU coupled to the system memory; anda GPU communicatively coupled to the CPU, wherein the GPU includes a program sequencer for controlling a plurality of stages of the GPU to execute a shader program, and wherein the program sequencer causes the GPU to: access a first portion of the shader program in graphics memory of the GPU;load instructions from the first portion into the plurality of stages of the GPU to configure the GPU for program execution;process a group of pixels in accordance with the instructions from the first portion;access a second portion of the shader program in graphics memory of the GPU;load instructions from the second portion into the plurality of stages of the GPU to configure the GPU for program execution;process the group of pixels in accordance with the instructions from the second portion;store the group of pixels in graphics memory subsequent to program execution for the first portion; andaccess the group of pixels to perform program execution for the second portion subsequent to loading instructions from the second portion, wherein the group of pixels are processed in accordance with the per first portion and subsequently processed in accordance with the 2nd portion to avoid shader program length limits and fragment surface count limits.
  • 16. The handheld computer system device of claim 15, wherein the program sequencer unit configures the GPU to load instructions for the shader program and execute the shader program.
  • 17. The handheld computer system device of claim 15, wherein the program sequencer unit executes a state machine that coordinates the operation of the GPU.
  • 18. The handheld computer system device of claim 15, wherein the shader program comprises more than two portions, and for each of the portions the program sequencer unit causes the GPU to process to spend pixels in accordance with the portion.
  • 19. The handheld computer system device of claim 15, wherein the state machine controls the plurality of stages of the GPU to implement a recirculation of the group of pixels through the plurality of stages of the GPU for a complex graphics processing operation.
  • 20. The handheld computer system device of claim 15, wherein the state machine controls the plurality of stages of the GPU to implement a pass-through of the group of pixels through the plurality of stages of the GPU for a basic graphics processing operation.
US Referenced Citations (201)
Number Name Date Kind
3091657 Stuessel May 1963 A
3614740 Delagi et al. Oct 1971 A
3987291 Gooding et al. Oct 1976 A
4101960 Stokes et al. Jul 1978 A
4541046 Nagashima et al. Sep 1985 A
4566005 Apperley et al. Jan 1986 A
4748585 Chiarulli et al. May 1988 A
4897717 Hamilton et al. Jan 1990 A
4958303 Assarpour et al. Sep 1990 A
4965716 Sweeney Oct 1990 A
4965751 Thayer et al. Oct 1990 A
4985848 Pfeiffer et al. Jan 1991 A
5040109 Bowhill et al. Aug 1991 A
5047975 Patti et al. Sep 1991 A
5175828 Hall et al. Dec 1992 A
5179530 Genusov et al. Jan 1993 A
5197130 Chen et al. Mar 1993 A
5210834 Zurawski et al. May 1993 A
5263136 DeAguiar et al. Nov 1993 A
5327369 Ashkenazi Jul 1994 A
5357623 Megory-Cohen Oct 1994 A
5375223 Meyers et al. Dec 1994 A
5388206 Poulton et al. Feb 1995 A
5388245 Wong Feb 1995 A
5418973 Ellis et al. May 1995 A
5430841 Tannenbaum et al. Jul 1995 A
5430884 Beard et al. Jul 1995 A
5432905 Hsieh et al. Jul 1995 A
5517666 Ohtani et al. May 1996 A
5522080 Harney May 1996 A
5560030 Guttag et al. Sep 1996 A
5561808 Kuma et al. Oct 1996 A
5574944 Stager Nov 1996 A
5627988 Oldfield May 1997 A
5644753 Ebrahim et al. Jul 1997 A
5649173 Lentz Jul 1997 A
5666169 Ohki et al. Sep 1997 A
5682552 Kuboki et al. Oct 1997 A
5682554 Harrell Oct 1997 A
5706478 Dye Jan 1998 A
5754191 Mills et al. May 1998 A
5761476 Martell Jun 1998 A
5764243 Baldwin Jun 1998 A
5784590 Cohen et al. Jul 1998 A
5784640 Asghar et al. Jul 1998 A
5796974 Goddard et al. Aug 1998 A
5802574 Atallah et al. Sep 1998 A
5809524 Singh et al. Sep 1998 A
5812147 Van Hook et al. Sep 1998 A
5835788 Blumer et al. Nov 1998 A
5848254 Hagersten Dec 1998 A
5920352 Inoue Jul 1999 A
5925124 Hilgendorf et al. Jul 1999 A
5940090 Wilde Aug 1999 A
5940858 Green Aug 1999 A
5949410 Fung Sep 1999 A
5950012 Shiell et al. Sep 1999 A
5978838 Mohamed et al. Nov 1999 A
5999199 Larson Dec 1999 A
6009454 Dummermuth Dec 1999 A
6016474 Kim et al. Jan 2000 A
6041399 Terada et al. Mar 2000 A
6049672 Shiell et al. Apr 2000 A
6073158 Nally et al. Jun 2000 A
6092094 Ireton Jul 2000 A
6108766 Hahn et al. Aug 2000 A
6112019 Chamdani et al. Aug 2000 A
6131152 Ang et al. Oct 2000 A
6141740 Mahalingaiah et al. Oct 2000 A
6144392 Rogers Nov 2000 A
6150610 Sutton Nov 2000 A
6189068 Witt et al. Feb 2001 B1
6192073 Reader et al. Feb 2001 B1
6192458 Arimilli et al. Feb 2001 B1
6208361 Gossett Mar 2001 B1
6209078 Chiang et al. Mar 2001 B1
6222552 Haas et al. Apr 2001 B1
6230254 Senter et al. May 2001 B1
6239810 Van Hook et al. May 2001 B1
6247094 Kumar et al. Jun 2001 B1
6252610 Hussain Jun 2001 B1
6292886 Makineni et al. Sep 2001 B1
6301600 Petro et al. Oct 2001 B1
6314493 Luick Nov 2001 B1
6317819 Morton Nov 2001 B1
6351808 Joy et al. Feb 2002 B1
6370617 Lu et al. Apr 2002 B1
6437789 Tidwell et al. Aug 2002 B1
6438664 McGrath et al. Aug 2002 B1
6480927 Bauman Nov 2002 B1
6490654 Wickeraad et al. Dec 2002 B2
6496902 Faanes et al. Dec 2002 B1
6499090 Hill et al. Dec 2002 B1
6525737 Duluk, Jr. et al. Feb 2003 B1
6529201 Ault et al. Mar 2003 B1
6597357 Thomas Jul 2003 B1
6603481 Kawai et al. Aug 2003 B1
6624818 Mantor et al. Sep 2003 B1
6629188 Minkin et al. Sep 2003 B1
6631423 Brown et al. Oct 2003 B1
6631463 Floyd et al. Oct 2003 B1
6657635 Hutchins et al. Dec 2003 B1
6658447 Cota-Robles Dec 2003 B2
6674841 Johns et al. Jan 2004 B1
6700588 MacInnis et al. Mar 2004 B1
6715035 Colglazier et al. Mar 2004 B1
6732242 Hill et al. May 2004 B2
6809732 Zatz et al. Oct 2004 B2
6812929 Lavelle et al. Nov 2004 B2
6825843 Allen et al. Nov 2004 B2
6825848 Fu et al. Nov 2004 B1
6839062 Aronson et al. Jan 2005 B2
6862027 Andrews et al. Mar 2005 B2
6891543 Wyatt May 2005 B2
6915385 Leasure et al. Jul 2005 B1
6944744 Ahmed et al. Sep 2005 B2
6952214 Naegle et al. Oct 2005 B2
6965982 Nemawarkar Nov 2005 B2
6975324 Valmiki et al. Dec 2005 B1
6976126 Clegg et al. Dec 2005 B2
6978149 Morelli et al. Dec 2005 B1
6978457 Johl et al. Dec 2005 B1
6981106 Bauman et al. Dec 2005 B1
6985151 Bastos et al. Jan 2006 B1
7015909 Morgan III et al. Mar 2006 B1
7031330 Bianchini, Jr. Apr 2006 B1
7032097 Alexander et al. Apr 2006 B2
7035979 Azevedo et al. Apr 2006 B2
7148888 Huang Dec 2006 B2
7151544 Emberling Dec 2006 B2
7154500 Heng et al. Dec 2006 B2
7159212 Schenk et al. Jan 2007 B2
7185178 Barreh et al. Feb 2007 B1
7202872 Paltashev et al. Apr 2007 B2
7260677 Vartti et al. Aug 2007 B1
7305540 Trivedi et al. Dec 2007 B1
7321787 Kim Jan 2008 B2
7334110 Faanes et al. Feb 2008 B1
7369815 Kang et al. May 2008 B2
7373478 Yamazaki May 2008 B2
7406698 Richardson Jul 2008 B2
7412570 Moll et al. Aug 2008 B2
7486290 Kilgariff et al. Feb 2009 B1
7487305 Hill et al. Feb 2009 B2
7493452 Eichenberger et al. Feb 2009 B2
7545381 Huang et al. Jun 2009 B2
7564460 Boland et al. Jul 2009 B2
7750913 Parenteau et al. Jul 2010 B1
7777748 Bakalash et al. Aug 2010 B2
7852341 Rouet et al. Dec 2010 B1
7869835 Zu Jan 2011 B1
8020169 Yamasaki Sep 2011 B2
20010026647 Morita Oct 2001 A1
20020116595 Morton Aug 2002 A1
20020130874 Baldwin Sep 2002 A1
20020144061 Faanes et al. Oct 2002 A1
20020194430 Cho Dec 2002 A1
20030001847 Doyle et al. Jan 2003 A1
20030003943 Bajikar Jan 2003 A1
20030014457 Desai et al. Jan 2003 A1
20030016217 Vlachos et al. Jan 2003 A1
20030016844 Numaoka Jan 2003 A1
20030031258 Wang et al. Feb 2003 A1
20030067473 Taylor et al. Apr 2003 A1
20030172326 Coffin, III et al. Sep 2003 A1
20030188118 Jackson Oct 2003 A1
20030204673 Venkumahanti et al. Oct 2003 A1
20030204680 Hardage, Jr. Oct 2003 A1
20030227461 Hux et al. Dec 2003 A1
20040012597 Zatz et al. Jan 2004 A1
20040073771 Chen et al. Apr 2004 A1
20040073773 Demjanenko Apr 2004 A1
20040103253 Kamei et al. May 2004 A1
20040193837 Devaney et al. Sep 2004 A1
20040205326 Sindagi et al. Oct 2004 A1
20040212730 MacInnis et al. Oct 2004 A1
20040215887 Starke Oct 2004 A1
20040221117 Shelor Nov 2004 A1
20040263519 Andrews et al. Dec 2004 A1
20050012759 Valmiki et al. Jan 2005 A1
20050024369 Xie Feb 2005 A1
20050071722 Biles Mar 2005 A1
20050088448 Hussain et al. Apr 2005 A1
20050239518 D'Agostino et al. Oct 2005 A1
20050262332 Rappoport et al. Nov 2005 A1
20050280652 Hutchins et al. Dec 2005 A1
20060020843 Frodsham et al. Jan 2006 A1
20060064517 Oliver Mar 2006 A1
20060064547 Kottapalli et al. Mar 2006 A1
20060103659 Karandikar et al. May 2006 A1
20060152519 Hutchins et al. Jul 2006 A1
20060152520 Gadre et al. Jul 2006 A1
20060176308 Karandikar et al. Aug 2006 A1
20060176309 Gadre et al. Aug 2006 A1
20070076010 Swamy et al. Apr 2007 A1
20070130444 Mitu et al. Jun 2007 A1
20070285427 Morein et al. Dec 2007 A1
20080016327 Menon et al. Jan 2008 A1
20080278509 Washizu et al. Nov 2008 A1
20090235051 Codrescu et al. Sep 2009 A1
20120023149 Kinsman et al. Jan 2012 A1
Foreign Referenced Citations (18)
Number Date Country
29606102 Jun 1996 DE
07-101885 Apr 1995 JP
H08-077347 Mar 1996 JP
H08-153032 Jun 1996 JP
08-297605 Dec 1996 JP
09-287217 Nov 1997 JP
H09-325759 Dec 1997 JP
10-222476 Aug 1998 JP
11-190447 Jul 1999 JP
2000-148695 May 2000 JP
2001-022638 Jan 2001 JP
2003-178294 Jun 2003 JP
2004-252990 Sep 2004 JP
100262453 May 2000 KR
1998-018215 Aug 2000 KR
413766 Dec 2000 TW
436710 May 2001 TW
442734 Jun 2001 TW
Non-Patent Literature Citations (68)
Entry
gDEBugger, graphicREMEDY, http://www.gremedy.com, Aug. 8, 2006.
Duca et al., A Relational Debugging Engine for Graphics Pipeline, International Conference on Computer Graphics and Interactive Techniques, ACM SIGGRAPH 2005, pp. 453-463, ISSN:0730-0301.
Parhami, Behrooz, Computer Arithmetic: Algorithms and Hardware Designs, Oxford University Press 2000, pp. 413-418, ISBN:0-19-512583-5.
IBM TDB, Device Queue Management, vol. 31 Iss. 10, pp. 45-50, Mar. 1, 1989.
Hamacher, V. Carl et al., Computer Organization, Second Edition, McGraw Hill, 1984, pp. 1-9.
Graham, Susan L. et al., Getting Up to Speed: The future of Supercomputing, the National Academies Press, 2005, glossary.
Rosenberg, Jerry M., Dictionary of Computers, Information Processing & Telecommunications, 2nd Edition, John Wiley & Sons, 1987, pp. 102 and 338.
Rosenberg, Jerry M., Dictionary of Computers, Information Processing & Telecommunications, 2nd Edition, John Wiley & Sons, 1987, pp. 305.
Graf, Rudolf F., Modern Dictionary of Electronics, Howard W. Sams & Company, 1988, pp. 273.
Graf, Rudolf F., Modern Dictionary of Electronics, Howard W. Sams & Company, 1984, pp. 566.
Wikipeida, definition of “subroutine”, published Nov. 29, 2003, four pages.
Graston et al. (Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture); Proceedings of the ACM SIGPLAN workshop on Languages, compilers and tools for embedded systems; pp. 138-144; Year of Publication: 2001.
SearchStorage.com Definitions, “Pipeline Burst Cache,” Jul 31, 2001, url: http://searchstorage.techtarget.com/sDefinition/0,,sid5—gci214414,00.html.
Brown, Brian; “Data Structure and Number Systems”; 2000; http://www.ibilce.unesp.br/courseware/datas/data3.htm.
Heirich; Optimal Automatic Mulit-pass Shader Partitioning by Dynamic Programming; Eurographics—Graphics Hardware (2005); Jul. 2005.
Hutchins E., SC10: A Video Processor and Pixel-Shading GPU for Handheld Devices; presented at the Hot Chips conferences on Aug. 23, 2004.
Wilson D., NVIDIA's Tiny 90nm G71 and G73: GeForce 7900 and 7600 Debut; at http://www.anandtech.com/show/1967/2; dated Sep. 3, 2006, retrieved Jun. 16, 2011.
Woods J., NVIDIA GeForce FX Preview, at http://www.tweak3d.net/reviews/nvidia/nv30preview/1.shtml; dated Nov. 18, 2002; retrieved Jun. 16, 2011.
NVIDIA Corporation, Technical Brief: Transform and Lighting; dated 1999; month unknown.
Merriam-Webster Dictionary Online; Definition for “program”; retrieved Dec. 14, 2010.
Gadre, S., Patent Application Entitled “Video Processor Having Scalar and Vector Components With Command FIFO for Passing Function Calls From Scalar to Vector”, U.S. Appl. No. 11/267,700, filed Nov. 4, 2005.
Gadre, S., Patent Application Entitled “Stream Processing in a Video Processor”, U.S. Appl. No. 11/267,599, filed Nov. 4, 2005.
Karandikar et al., Patent Application Entitled: “Multidemnsional Datapath Processing in a Video Processor”, U.S. Appl. No. 11/267,638, filed Nov. 4, 2005.
Karandikar et al., Patent Application Entitled: “A Latency Tolerant System for Executing Video Processing Operations”, U.S. Appl. No. 11/267,875, filed Nov. 4, 2005.
Gadre, S., Patent Application Entitled “Separately Schedulable Condition Codes for a Video Processor”, U.S. Appl. No. 11/267,793, filed Nov. 4, 2005.
Lew, et al., Patent Application Entitled “A Programmable DMA Engine for Implementing Memory Transfers for a Video Processor”, U.S. Appl. No. 11/267,777, filed Nov. 4, 2005.
Karandikar et al., Patent Application Entitled: “A Pipelined L2 Cache for Memory Transfers for a Video Processor”, U.S. Appl. No. 11/267,606, filed Nov. 4, 2005.
Karandikar, et al., Patent Application Entitled: “Command Acceleration in a Video Processor”, U.S. Appl. No. 11/267,640, filed Nov. 4, 2005.
Karandikar, et al., Patent Application Entitled “A Configurable SIMD Engine in a Video Processor”, U.S. Appl. No. 11/267,393, filed Nov. 4, 2005.
Karandikar, et al., Patent Application Entitled “Context Switching on a Video Processor Having a Scalar Execution Unit and a Vector Execution Unit”, U.S. Appl. No. 11/267,778, filed Nov. 4, 2005.
Lew, et al., Patent Application Entitled “Multi Context Execution on a Video Processor”, U.S. Appl. No. 11/267,780, filed Nov. 4, 2005.
Su, Z, et al., Patent Application Entitled: “State Machine Control for a Pipelined L2 Cache to Implement Memory Transfers for a Video Processor”, U.S. Appl. No. 11/267,119, filed Nov. 4, 2005.
Kozyrakis, “A Media enhanced vector architecture for embedded memory systems,” Jul. 1999, http://digitalassets.lib.berkeley.edu/techreports/ucb/text/CSD-99/1059.pdf.
HPL-PD A Parameterized Research Approach—May 31, 2004 http://web.archive.org/web/www.trimaran.org/docs/5—hpl-pd.pdf.
Intel, Intel Architecture Software Deveopler's Manual, vol. 1: Basic Architecture 1997 p. 8-1.
Intel, Intel Architecture Software Deveopler's Manual, vol. 1: Basic Architecture 1999 p. 8-1, 9-1.
Intel, Intel Pentium III Xeon Processor at 500 and 550Mhz, Feb 1999.
Free On-Line Dictionary of Computing (FOLDOC), defintion of “video”, from foldoc.org/index.cgi? query=video&action=Search, May 23, 2008.
FOLDOC, definition of “frame buffer”, from foldoc.org/index.cgi?query=frame+buffer&action=Search, Oct. 3, 1997.
PCreview, article entitled “What is a Motherboard”, from www.pcreview.co.uk/articles/Hardware/What—is—a—Motherboard., Nov. 22, 2005.
FOLDOC, definition of “motherboard”, from foldoc.org/index.cgi?query=motherboard&action=Search, Aug. 10, 2000.
FOLDOC, definition of “separate compilation”, from foldoc.org/index.cgi?query=separate+compilation&action=Search, Feb. 19, 2005.
FOLDOC, definition of “vector processor”, http://foldoc.org/, Sep. 11, 2003.
Wikipedia, defintion of “vector processor”, http://en.wikipedia.org/, May 14, 2007.
Fisher, Joseph A., Very Long Instruction Word Architecture and the Eli-512, ACM, 1993, pp. 140-150.
FOLDOC (Free On-Line Dictionary of Computing), defintion of X86, Feb. 27, 2004.
FOLDOC, definition of “superscalar,” http://foldoc.org/, Jun. 22, 2009.
FOLDOC, definition of Pentium, Sep. 30, 2003.
Wikipedia, definition of “scalar processor,” Apr. 4, 2009.
Intel, Intel MMX Technology at a Glance, Jun. 1997.
Intel, Pentium Processor Family Developer's Manual, 1997, pp. 2-13.
Intel, Pentium processor with MMX Technology at 233Mhz Performance Brief, Jan. 1998, pp. 3 and 8.
Wikipedia, entry page defining term “SIMD”, last modified Mar. 17, 2007.
FOLDOC, Free Online Dictionary of Computing, defintion of SIMD, foldoc.org/index.cgi?query=simd&action=Search, Nov. 4, 1994.
Definition of “queue” from Free on-Line Dictionary of Computing (FOLDOC), http://folddoc.org/index.cgi? query=queue&action=Search, May 15, 2007.
Definition of “first-in first-out” from FOLDOC, http://foldoc.org/index.cgi?query=fifo&action=Search, Dec. 6, 1999.
Definition of “block” from FOLDOC, http://foldoc.org/index.cgi?block, Sep. 23, 2004.
Quinnell, Richard A. “New DSP Architectures Go “Post-Harvard” for Higher Performance and Flexibility” Techonline; posted May 1, 2002.
Wikipedia, definition of Multiplication, accessed from en.wikipedia.org/w/index.php?title=Multiplication&oldid=1890974, published Oct. 13, 2003.
“Vertex Fog”; http://msdn.microsoft.com/library/en-us/directx9—c/Vertex—fog.asp?frame=true Mar. 27, 2006.
“Anti-aliasing”; http://en.wikipedia.org/wiki/Anti-aliasing; Mar. 27, 2006.
“Alpha Testing State”; http://msdn.microsoft.com/library/en-us/directx9—c/directx/graphics/programmingguide/GettingStarted/Direct3Kdevices/States/renderstates/alphatestingstate.asp Mar. 25, 2005.
Defintion of “Slot,” http://www.thefreedictionary.com/slot, Oct. 2, 2012.
Korean Intellectual Property Office; English Abstract for Publication No. 100262453, corresponding to application 1019970034995, 2001.
Espasa R et al: “Decoupled vector architectures”, High-Performance Computer Architecture, 1996. Proceedings., Second International Symposium on San Jose, CA, USA Feb. 3-7, 1996, Los Alamitos, CA, USA, IEEE Comput. Soc, US, Feb. 3, 1996, pp. 281-290, XPO1 0162067. DOI: 10.11 09/HPCA, 1996.501193 ISBN: 978-0-8186-7237-8.
Dictionary of Computers, Information Processing & Technology, 2nd edition. cited by applicant.
Karandikar et al., Patent Application Entitled A Pipelined L2 Cache for Memory Transfers for A Video Processor:, U.S. Appl. No. 11/267,606, filed Nov. 4, 2005.
Gadre, S., Patent Application Entitled “Video Processor Having Scalar and Vector Components With Command FIFO for Passing Function Calls From Scalar to Vector”, U.S. Appl. No. 11/267,700, filed Nov. 4, 2005. cited by applicant.