1. Field of the Invention
The invention described herein relates to execution of programs in a processor, and more particularly relates to the sequence of execution of program threads.
2. Background Art
It is common for streaming processors to execute a program by executing individual threads of the program. Conventionally, threads must complete in the order they are created. Prioritizing the execution of instructions older threads ahead of newer threads helps ensure that threads complete in the order they were started, particularly if the thread execute the same instructions. If threads can be completed in the order they are created, then fewer threads are in existence at any given time, consuming fewer overall resources. The threads do not necessarily need to complete in order, and often will not, but there are advantages to having them complete in order. Further, it is common practice to load one or more instructions into an instruction cache from memory, prior to the execution of the instruction(s). This avoids the time-intensive process of having to fetch an instruction from memory when the instruction is needed.
Still, in spite of the caching process, some latency remains, given the requirement of having to load one or more instructions into the instruction cache. The latency from the caching process can be so significant that some multi-threaded processors may switch to a different thread while an instruction of the original thread is being cached. Therefore, when processor resources become free, the instructions that need to use these resources have not yet been loaded into the instruction cache. The processor resources will go unused by these instructions until the high latency fetch has been completed. A similar situation exists when a local data cache is used to store constant values referenced by instructions. Here, if an instruction needs a new set of constants (different from the constants currently cached) instruction execution may stall until a new set of constants have been loaded. This problem is particularly important in computer graphics processing. Any data that is accessed through a cache and needed by a shader program, for example, will potentially create this problem.
One commonly implemented method to avoid leaving processor resources unused during an instruction or data fetch is to pre-fetch instructions or data into a cache prior to execution. Such a mechanism generally requires significant additional hardware complexity, however.
There is a need, therefore, for a method and system that allows for the streamlining of the execution of multiple threads. A desired solution would have to avoid the pitfalls of a pre-fetch scheme, while otherwise addressing the above described latency problems in the caching of instructions and data.
In an embodiment of the invention, the execution of the first thread of a new program is prioritized ahead of older threads for a previously running program. This is illustrated in
The process of this embodiment is illustrated in greater detail in
Another embodiment of the invention is illustrated in
The process of this embodiment is illustrated in
The invention can work with software, hardware, and/or operating system implementations other than those described herein. Any software, hardware, and operating system implementations suitable for performing the functions described herein can be used.
In an embodiment of the present invention, the system and components of the present invention described herein are implemented using well known computer systems, such as a computer system 500 shown in
The computer system 500 includes one or more processors (also called central processing units, or CPUs), such as a processor 504. This processor may be a graphics processor in an embodiment of the invention. The processor 504 is connected to a communication infrastructure or bus 506. The computer system 500 also includes a main or primary memory 508, such as random access memory (RAM). The primary memory 508 has stored therein control logic (computer software), and data.
The computer system 500 also includes one or more secondary memory storage devices 510. The secondary storage devices 510 include, for example, a hard disk drive 512 and/or a removable storage device or drive 514. The removable storage drive 514 represents a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device drive, tape drive, etc.
The removable storage drive 514 interacts with a removable storage unit 518. The removable storage unit 518 includes a computer useable or readable storage medium having stored therein computer software (control logic) and/or data. The logic of the invention as illustrated in
The computer system 500 may also include input/output/display devices 530, such as monitors, keyboards, pointing devices, etc.
The computer system 500 further includes a communication or network interface 527. The network interface 527 enables the computer system 500 to communicate with remote devices. For example, the network interface 527 allows the computer system 500 to communicate over communication networks or mediums 526 (representing a form of a computer useable or readable medium), such as LANs, WANs, the Internet, etc. The network interface 527 may interface with remote sites or networks via wired or wireless connections.
Control logic may be transmitted to and from the computer system 500 via the communication medium 526. More particularly, the computer system 500 may receive and transmit carrier waves (electromagnetic signals) modulated with control logic via the communication medium 526.
Any apparatus or manufacture comprising a computer useable or readable medium having control logic (software) stored therein is referred to herein as a computer program product or program storage device. This includes, but is not limited to, the computer system 500, the main memory 508, the hard disk 512, and the removable storage unit 518. Carrier waves can also be modulated with control logic. Such computer program products, having control logic stored therein that, when executed by one or more data processing devices, cause such data processing devices to operate as described herein, represent embodiments of the invention.
It is to be appreciated that the Detailed Description section, and not the Abstract section, is intended to be used to interpret the claims. The Abstract section may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventors, and thus, are not intended to limit the present invention and the appended claims in any way.
The present invention has been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.