1. Technical Field
The present disclosure relates generally to information processing systems and, more specifically, to improved efficiency for self-scheduling of user-level threads that are not scheduled by an operating system.
2. Background Art
In order to increase performance of information processing systems, such as those that include microprocessors, both hardware and software techniques have been employed. On the hardware side, microprocessor design approaches to improve microprocessor performance have included increased clock speeds, pipelining, branch prediction, super-scalar execution, out-of-order execution, and caches. Many such approaches have led to increased transistor count, and have even, in some instances, resulted in transistor count increasing at a rate greater than the rate of improved performance.
Rather than seek to increase performance strictly through additional transistors, other performance enhancements involve software techniques. One software approach that has been employed to improve processor performance is known as “multithreading.” In software multithreading, an instruction stream may be divided into multiple instruction streams that can be executed in parallel. Alternatively, multiple independent software streams may be executed in parallel.
In one approach, known as time-slice multithreading or time-multiplex (“TMUX”) multithreading, a single processor switches between threads after a fixed period of time. In still another approach, a single processor switches between threads upon occurrence of a trigger event, such as a long latency cache miss. In this latter approach, known as switch-on-event multithreading (“SoEMT”), only one thread, at most, is active at a given time.
Increasingly, multithreading is supported in hardware. For instance, in one approach, processors in a multi-processor system, such as a chip multiprocessor (“CMP”) system, may each act on one of the multiple software threads concurrently. In another approach, referred to as simultaneous multithreading (“SMT”), a single physical processor is made to appear as multiple logical processors to operating systems and user programs. For SMT, multiple software threads can be active and execute simultaneously on the single physical processor without switching. That is, each logical processor maintains a complete set of the architecture state, but many other resources of the physical processor, such as caches, execution units, branch predictors, control logic and buses are shared. For SMT, the instructions from multiple software threads each on a distinct logical processor, execute concurrently.
For a system that supports concurrent execution of software threads, such as SMT and/or CMP systems, an operating system application may control scheduling and execution of the software threads. Typically, however, operating system control does not scale well; the ability of an operating system application to schedule threads without negatively impacting performance is commonly limited to a relatively small number of threads.
Embodiments of the present invention may be understood with reference to the following drawings in which like elements are indicated by like numbers. These drawings are not intended to be limiting but are instead provided to illustrate selected embodiments of an apparatus, system and method to judiciously schedule user-level threads in a multithreaded system.
The following discussion describes selected embodiments of methods, systems and articles of manufacture to improve efficiency of scheduling for multiple concurrently-executed user-level threads of execution (referred to as “shreds”) that are not created or scheduled by the operating system. The shreds are instead scheduled by a feedback-driven scheduler that can dynamically adapt shred scheduling based on runtime feedback and prediction of inter-shred correlations.
The shreds may be scheduled to run on one or more OS-sequestered sequencers. The OS-sequestered sequencers are sometimes referred to herein as “OS-invisible”; the operating system does not schedule work on such sequencers. The mechanisms described herein may be utilized with single-core or multi-core multithreading systems. In the following description, numerous specific details such as processor types, multithreading environments, system configurations, and numbers and topology of sequencers in a multi-sequencer system have been set forth to provide a more thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the invention may be practiced without such specific details. Additionally, some well known structures, circuits, and the like have not been shown in detail to avoid unnecessarily obscuring the present invention.
A shared-memory multiprocessing paradigm may be used in an approach referred to as parallel programming. According to this approach, an application programmer may split a software program, sometimes referred to as an “application” or “process,” into multiple tasks to be run concurrently in order to express parallelism for a software program. All threads of the same software program (“process”) share a common logical view of memory.
The operating system (“OS”) 140 is commonly responsible for managing the user-defined tasks for a process (e.g., processes 103 and 120). While each process has at least one task (see, e.g., process 0100 and process 2103), others may have more than one (e.g., Process 1120) such tasks. The number of processes illustrated in
The OS 140 is commonly responsible for scheduling these threads 125, 126, 127 for execution on the execution resources. The threads associated with the same process typically have the same virtual memory address space.
Because the OS 140 is responsible for creating, mapping, and scheduling threads, the threads 125, 126, 127 are “visible” to the OS 140. In addition, embodiments of the present invention comprehend additional threads 130-139 that are not visible to the OS 140. That is, the OS 140 does not create, manage, or otherwise acknowledge or control these additional threads 130-139. These additional threads, which are neither created nor controlled by the OS 140, are sometimes referred to herein as “shreds” 130-139 in order to distinguish them from OS-visible threads. The shreds are created and managed by user-level programs (referred to as “shredded programs”) and may be scheduled to run on sequencers that are sequestered from the operating system. The OS-sequestered sequencers typically share a common set of ring 0 states as OS-visible sequencers. These shared ring-0 architectural states are typically those responsible for supporting a common shared memory address space between the OS-visible sequencer and OS-sequestered sequencers. For example, for an embodiment based on IA-32 architecture, CR0, CR2, CR3, CR4 are some of these shared ring-0 architectural states. Shreds thus share the same execution environment (virtual address map) that is created for the threads associated with the same process.
As used herein, the terms “thread” and “shred” include, at least, the concept of a set of instructions to be executed concurrently with other threads and/or shreds of a process. The thread and “shred” terms both encompass the idea, therefore, of a set of software primitives or application programming interfaces (API). As used herein, a distinguishing factor between a thread (which is OS-controlled) and a shred (which is not visible to the operating system and is instead user-controlled), which are both instruction streams, lies in the difference of how scheduling and execution of the respective thread and shred instruction streams are managed. A thread is generated in response to a system call to the OS. The OS generates that thread and allocates resources to run the thread. Such resources allocated for a thread may include data structures that the operating system uses to control and schedule the threads.
In contrast, at least one embodiment of a shred is generated via a user level software “primitive” that invokes an OS-independent mechanism for generating a shred that the OS is not aware of. A shred may thus be generated in response to a user-level software call. For at least one embodiment, the user-level software primitives may involve user-level (ring-3) instructions that can create a user-level shred in hardware or firmware. The user-level shred thus created may be scheduled by hardware and/or firmware and/or user-level software. The OS-independent mechanism may be software code that sits in user space, such as a software library. The techniques for shred scheduling optimizations discussed herein may be used with any user-level thread package.
However, other processes 103, 120 may be associated with one or more OS-scheduled threads as illustrated in
illustrates that a particular logical view 200 of memory is shared by all threads 125, 126 associated with a particular process 120.
Accordingly,
Thus, instead of relying on the operating system to manage the mapping between thread unit hardware and shreds, scheduler logic in user space may manage the mapping. For at least one embodiment, the scheduler logic may be in a runtime software library.
For at least one embodiment a user may directly control such mapping by utilizing shred control instructions or primitives that are handled by the scheduler or other logic in software, such as in a runtime library. In addition, the user may directly manipulate control and state transfers associated with shred execution. Accordingly, for embodiments of the methods, mechanisms, articles of manufacture, and systems described herein, a user-visible feature of the architecture of the thread units is at least a canonical set of instructions that allow a user direct manipulation and control of thread unit hardware.
As used herein, a thread unit, also interchangeably referred to herein as a “sequencer”, may be any physical or logical unit capable of executing a thread or shred. It may include next instruction pointer logic to determine the next instruction to be executed for the given thread or shred. For example, the OS thread 125 illustrated in
In the single-core multithreading environment 310, a single physical processor 304 is made to appear as multiple logical processors (not shown), referred to herein as LP1 through LPn, to operating systems and user programs. Each logical processor LP1 through LPn maintains a complete set of the architecture state AS1-ASn, respectively. The architecture state includes, for at least one embodiment, data registers, segment registers, control registers, debug registers, and most of the model specific registers. The logical processors LP1-LPn share most other resources of the physical processor 304, such as caches, execution units, branch predictors, control logic and buses. Although such features may be shared, each thread context in the multithreading environment 310 can independently generate the next instruction address (and perform, for instance, a fetch from an instruction cache, an execution instruction cache, or trace cache). Thus, the processor 304 includes logically independent next-instruction-pointer and fetch logic 320 to fetch instructions for each thread context, even though the multiple logical sequencers may be implemented in a single physical fetch/decode unit 322. For a single-core multithreading embodiment, the term “sequencer” encompasses at least the next-instruction-pointer and fetch logic 320 for a thread context, along with at least some of the associated architecture state, 312, for that thread context. It should be noted that the sequencers of a single-core multithreading system 310 need not be symmetric. For example, two single-core multithreading sequencers for the same physical core may differ in the amount of architectural state information that they each maintain.
A single-core multithreading system can implement any of various multithreading schemes, including simultaneous multithreading (SMT), switch-on-event multithreading (SoeMT) and/or time multiplexing multithreading (TMUX). When instructions from more than one hardware thread contexts (or logical processor) run in the processor concurrently at any particular point in time, it is referred to as SMT. Otherwise, a single-core multithreading system may implement SoeMT, where the processor pipeline is multiplexed between multiple hardware thread contexts, but at any given time, only instructions from one hardware thread context may execute in the pipeline. For SoeMT, if the thread switch event is time based, then it is TMUX.
Thus, for at least one embodiment, the multi-sequencer system 310 is a single-core processor 304 that supports concurrent multithreading. For such embodiment, each sequencer is a logical processor having its own instruction next-instruction-pointer and fetch logic and its own architectural state information, although the same physical processor core 304 executes all thread instructions. For such embodiment, the logical processor maintains its own version of the architecture state, although execution resources of the single processor core may be shared among concurrently-executing threads.
For at least one embodiment of the multi-core system 350 illustrated in
For ease of discussion, the following discussion focuses on embodiments of the multi-core system 350. However, this focus should not be taken to be limiting, in that the mechanisms described below may be performed in either a multi-core or single-core multi-sequencer environment.
As is stated above, the scheduling mechanism 400 may be employed rather than an OS-provided scheduling mechanism. Each work descriptor describes a shred that is to be executed, independent of OS intervention, on either an OS-sequestered or OS-visible sequencer.
Shred descriptors may be created in response to user-level shred creation instructions (or “primitives”) executed by another shred or by a shred-aware thread. The descriptors may be placed into the work queue system 402. For at least one embodiment, the user-level instructions that trigger creation of shred descriptors are API-like (“Application Programmer Interface”) thread control primitives such as “shred_create” or “shred_fork”.
As used herein, an instruction or primitive described as being generated by a programmer or user is intended to encompass not only architectural instructions that may generated by an assembler or compiler based on user-generated code, or by a programmer working in an assembly language, but also any high-level primitive or instruction that may ultimately be assembled or compiled into architectural shred control instructions. It should also be understood that an architectural shred control instruction may be further decoded into one or more micro-operations.
One of skill in the art will recognize that there may be one or more levels of abstraction between the programmer's code (e.g., code that includes an API-like shred_create primitive) and actual architectural instructions that cause a sequencer to perform actions resulting in the generation of shred descriptors and placement of the descriptors into a work queue 402. Software 440, such as that provided by a software runtime library, may create, responsive to a shred_create primitive, a shred descriptor for the new shred and may place it into the work queue system 402.
For at least one embodiment, then, a shred descriptor is thus created by software 440 responsive to a shred_create primitive and is placed into the queue system 402. The shred descriptor may be, for at least one embodiment, a record that identifies at least the following properties for a shred: a) the address at which the shred should begin execution and b) a stack descriptor. The stack descriptor identifies the memory storage area (stack) to be used by the new shred to store temporary variables, such as local variables and return addresses.
It should be noted that the sequencers 403, 404 illustrated in
Regarding symmetry,
The sequencers 403, 404 may differ in any manner, including those aspects that affect quality of computation. For example, the sequencers may differ in terms of power consumption, thermal metrics, speed of computational performance, finctional features, microarchitectural organization, architectural features, or the like. By way of example, for one embodiment, the sequencers 403, 404 may differ in terms of functionality. For example, one sequencer may be capable of executing integer and floating point instructions, but cannot execute a single instruction multiple data (“SIMD”) set of instruction extensions, such as Streaming SIMD Extensions 3 (“SSE3”). On the other hand, another sequencer may be capable of performing all the instructions that the first sequencer can execute, and can also execute SSE3 instructions.
As another example of functional asymmetry, one sequencer 403 may be visible to the OS (see, for example, 140 of
The sequencers of a system on which the scheduling mechanism 400 is utilized may also differ in any other manner, such as footprint, word width and/or data path size, topology, memory, power consumption, number of functional units, communication architectures (multi-drop vs. point-to-point interconnect), or any other metric related to functionality, performance, footprint, or the like.
For at least one embodiment, the functionality of type A and type B sequencers may be mutually exclusive. That is, for example, one type of sequencer 403 may support a particular functionality, such as execution of SSE3 instructions, that the other type of sequencer 404 does not support; while the second type of sequencer 404 may support a particular functionality, such as ring 0 operations, that the first type of sequencer 403 does not support.
However, for at least one other embodiment, the functionality of sequencer types A 403 and B 404 represent a superset-subset functionality relationship rather than a mutually exclusive functionality relationship. That is, a first set of sequencers (such as type A sequencers 403) provide a superset of functionality that includes all functionality of a second set of sequencers (such as type B sequencers 404), plus additional functionality that is not provided by the second set of sequencers 404.
For at least some embodiments of the mechanisms, systems, and methods described herein, a distributed scheduler 450 operates as an event-driven self-scheduler where shreds are created in response to queued scheduling events that are created as a result of API-like shred control (e.g., shred_create, shred_fork and/or the like) or shred synchronization (e.g., shred_yield, mutex (shred_lock/shred_unlock), critical section, and/or the like) instructions or primitives.
In addition, the software library 500 may also include shred synchronization control software 504. The shred synchronization control software 504 may perform shred synchronization functions in response to a shred synchronization user-level primitive, such as a yield primitive or a shred mutex or critical section primitive.
If a “yield” primitive is encountered in the current shred, a shred descriptor for the calling process may be placed back into the queue system and control returned to the scheduler 450. Accordingly, upon execution of a “yield” primitive, the synchronization control software 504 may place a shred descriptor for the remaining shred instructions for the current shred back into the work queue system 402 (
In addition, the software library 500 may also include a scheduling hints generator 506. The scheduling hints generator 506 may create a shred dependency graph (SDG) and/or time-stamped shred dependency graph (TSDG), discussed in further detail below.
The run-time library 500 may create an intermediate layer of abstraction between a traditional industry standard API, such as a Portable Operating System Interface (“POSIX”) compliant API, and the hardware of a multi-sequencer system that supports at least a canonical set of shred instructions. The run-time library 500 may act as an intermediate level of abstraction so that a programmer may utilize a traditional thread API (such as, for instance, PTHREADS API or WINDOWS THREADS API or OPENMP API) with hardware that supports shredding. The library 500 may provide functions that transparently invoke the canonical shred instructions, based on user-programmed primitives.
The scheduling hints generator 506 also, in addition to monitoring program behavior, may analyze, characterize and record certain aspects of the execution history. For at least one embodiment, these aspects of the execution history may be recorded in the form of either or both of a shred dependency graph 600 and/or a time-stamped shred dependency graph 604.
The shred dependency graph (“SDG”) 600 explicitly represents shredded program execution as a graph of shred dependencies. For at least one embodiment, the SDG 600 may be a directed graph, where each node is a shred and each line is a dependency between two shreds. The SDG 600 thus represents the dependencies among the shred instances that are dynamically executed during an execution pass of the shredded program 602.
The label on each of these four edges shown in
Returning to
A dependence may be recorded when the scheduler encounters a shred control primitive or instruction such “shred_create”. In addition, a dependence may be recorded when the scheduler encounters a synchronization primitive or instruction such as a mutex, yield, or critical section primitive. That is, a dependence may be defined as an occurrence of one shred being blocked from further execution while waiting for some event to occur on another shred. For example,
It is straightforward to identify which shreds are on the system critical path with the information provided by the TSDG 800. The system critical path 820 may be easily identified by starting at the node of the TSDG 800 that has the largest time value (representing the latest node) and traversing upwards to the root of the TSDG 800.
Returning to
In addition, if the hints generator 506 utilizes information from the shred synchronization control software 504, such as information related to synchronization objects such as mutex, conditional variables, etc, then the SDG 600 and/or TSDG 604 generated based on such information may also reflect shred data dependencies in addition to shred control dependencies.
The scheduling hints generator 506 may employ any one or more of several optimization approaches that take advantage of the scheduling information 608 about dynamic behavior of inter-shred interactions of the shredded program 602. Any optimization approach that attempts to explore thread-level parallelism may be employed. For example, thread-level analogs may be implemented for many classic instruction-level parallelism (ILP) algorithms that are based on instruction data or control dependency graphs. These algorithms include list scheduling, stochastic scheduling, and tree traversal scheduling. Analogous approaches for thread-level parallelism, based on the SDG and the TSDG, may be employed. For at least one embodiment, the optimization approaches employed by the scheduling hints generator 506 may include one or more of: system critical path scheduling, data flow shred scheduling, and dynamic power throttling.
System Critical Path Scheduling. This optimization approach recognizes that certain nodes of the TSDG 604 are more critical to performance of the application program 602 than are other nodes. When performing the system critical path scheduling optimization, the hints generator 506 identifies the critical path-those nodes whose performance affects overall performance for the program 602. The system critical path through the TSDG 604 has the property that no other path in the program 602 has a longer latency. If these nodes take longer to execute, then overall performance of the program 602 is slowed. The hints generator 506 identifies all shreds on the critical path as “critical shreds” and provides a hint to indicate that the scheduler 450 should schedule such shreds with a higher priority than other, non-critical, shreds.
By using this system critical path information, a shred scheduler 450 may improve performance by prioritizing critical shreds. For a scheduler on a symmetric multi-sequencer system, the optimization may involve simply scheduling critical shreds with a higher priority. For an asymmetric multi-sequencer system, the optimization may, for example, involve scheduling critical shreds on faster and/or more powerful sequencers. In general, the scheduler may utilize system critical path information to reduce latency of the system critical path in order to reduce overall program latency.
Data Flow Scheduling. In contrast to system critical path scheduling, which seeks to improve performance by reducing the latency of the critical path of the system, data flow scheduling seeks to reduce latency for an individual shred. In this approach, the scheduler 450 may seek to schedule to the same sequencer those shreds that share data. One goal of such technique is to improve data locality and therefore to decrease the overall number of cache misses, thereby decreasing execution time for a shred.
As is explained above, the TSDG (see 800,
Dynamic Power Throttling. Rather than attempting to improve performance, the third optimization approach attempts to reduce energy usage by dynamically controlling a power throttle. This approach may be utilized for an asymmetric multiprocessing system that includes one or more sequencers for which power usage may be down-throttled. When down-throttled, the sequencers may utilize less power, be more energy-efficient, and may have a slower execution time.
As has been stated above, the system critical path can be easily determined from the TSDG and therefore, conversely, the TSDG also identifies the shreds that are not performance-critical. The hints generator 506 may thus pass hints 610 that identify non-critical shreds to the scheduler 450. The scheduler 450 may schedule such non-critical shreds on down-throttled sequencers. For an asymmetric multiprocessing system, the scheduler 450 may control the throttling mechanism and may, therefore, essentially control the behavior of the system. Thus, by using system critical path information provided by the TSDG, hints can be generated and provided to a scheduler, which can reduce overall energy usage by dynamically throttling the asymmetric multiprocessing system.
As an alternative embodiment, an asymmetric multiprocessing system may include sequencers of varying fixed power consumption requirements. That is, one or more sequencers may, rather than having power dynamically throttled, be statically configured at a lower power consumption requirement than one or more other sequencers in the system. For such embodiment, non-performance-critical shreds may be scheduled on the lower-power sequencer(s).
Continuing to consult
For the former approach (online analysis), only a partial TSDG 604 is generated by the scheduling hints generator 506. Using a partial TSDG 604 that has been generated for a window of execution for the shredded program 602, the scheduling hints generator 506 predicts scheduling priority for shreds as the program 602 continues to run. The hints can be used as a predictor for future execution behavior. The output of the scheduler is a new schedule based on these hints or predictions, with the goal to improve performance.
For the latter approach (offline analysis), a full TSDG 604 may be generated during a first pass through the shredded program 602. Scheduling hints 610 generated by the scheduling hints generator 506, based on the full TSDG 604, may then be forwarded to the scheduler 450 and utilized during a subsequent execution pass of the shredded program 602.
At least one embodiment combines the online and offline analysis approaches for a hybrid approach. For the hybrid approach, offline analysis results in scheduling hints harvested from a prior run and profile; such hints are passed to the scheduler 450. With the offline scheduling hints as input, the scheduler 450 may also dynamically refine, adjust, adapt and update the hints based on dynamic shred scheduling behaviors as observed via online analysis.
At block 954, the execution history file “text” may be sorted and an alphabet 970 of unique “symbols” may be generated. Each symbol in the alphabet 970 may be used to represent a unique shred instance. The alphabet 970 may be ranked according to frequency of occurrence for each symbol. In addition, the execution history, based on shred identifiers, recorded at block 952 may be translated into a symbol-based execution history at block 954.
As a further example to illustrate the processing of the method 950, assume that a sequence of shred instances is recorded in the execution history at block 952 for a scheduling loop, and translated to symbols at block 954. A sample sequence is set forth in Table 1:
The sample sequence shown in Table 1 indicates that several patterns of recurrent sequences of adjacent symbols may be identified in the symbol-based execution history generated at block 954. For example, Table 1 illustrates that an instance of shred A is always followed by shred B. Thus, AB may be identified as a “phrase.” Such recurrent phrase may be recorded at block 956 in a phrase dictionary 980. Based upon this dictionary 980, a hint may be generated at block 958 to let the scheduler know that shred B is often scheduled after shred A. Upon fturther examination, one can see that the pattern “A, B, C, D” is an even bigger phrase evident in Table 1. Accordingly, the phrase “A, B, C, D” may be recorded in the phrase dictionary 980 at block 956, and a hint about this phrase may be generated at block 958.
The phrases recorded in the phrase dictionary 980 may be identified, for at least one embodiment, by running a compression algorithm at block 956 against the symbol-based execution history that has been generated at block 954. For at least one embodiment, the compression algorithm is an Lempel-Ziv-equivalent compression method for which the alphabet is extended from 8-bit ASCII to a new alphabet represented by the 32-bit or 64-bit symbols in the symbol alphabet 970 that was generated at block 954.
For at least one embodiment, the compression algorithm used at block 956 is proven information-theoretically optimal and efficient (with time linear to the size of the input text and the lookup time close to constant). The result of compression as applied at block 956 may be the phrase dictionary 970, which enumerates the frequently-recurring phrases of symbols that appear in the symbol-based execution history that was generated at block 954. For such embodiment, each phrase in the phrase dictionary 980 represents a recurrent chain of shred scheduling activities involving a particular set of shreds, which may be interacting through a particular set of synchronization objects and/or control primitives in a particular order. The frequency (that is, the amount of redundancy) of each of these recurrent chains may be used to rank the phrases in the phrase dictionary 980.
To briefly delve a bit deeper into data flow shred scheduling concepts supported by embodiments of the scheduler disclosed herein, one should note that, for at least one embodiment, each processor in a multi-core system includes a cache. It should also be noted that shreds for the same thread may share the same application working set. For example, if shred B depends on shred A, there could be a synchronization point (mutex, etc.) around data that is shared by both shreds. Also, or in the alternative, shreds A and B might touch the same data structure. Generally, if shred B depends on shred A, the scheduler may assume that the shreds share at least some data.
Accordingly, the hints generator may generate a hint, at block 958, to indicate that shreds A and B should be scheduled on the same core, if possible, so that they can share a data cache. In sum, the hints generator may generate a “locality” hint based on linear dependency so that the consumer maybe scheduled to execute close to, or on the same sequencer as, the producer shred. In this manner, the scheduler may effectively move code in order to accommodate data dependencies. Generally stated, the scheduler may attempt to schedule linearly dependent shreds to execute, serially, on the same (or a nearby) sequencer in order to take advantage of data locality at the cache level. This approach is based on the assumption that linearly dependent shreds are likely to use the same data. In other words, the scheduler logic 450 may schedule shreds for execution close to where the working set resides.
Alternatively, the scheduler may utilize a locality hint in order to migrate a working set of data from one cache to another. That is, the scheduler may cause data to be moved to the core on which will execute the code that needs the data. Such approach may be utilized for systems in which the sequencer hardware supports data migration. In other words, the scheduler 450 may schedule data movement towards where the code that uses the data resides.
The scheduler may also take advantage of locality hints to implement a type of shred-level parallelism. If the scheduler receives a hint that shreds A, B, C, and D are linearly dependent and are often executed sequentially as a “phrase”, the scheduler can map the shreds on adjacent sequencers. In addition, the data from each of the sequencers can be migrated along the chain of sequencers so that data is migrated through the dependence chain, although the code for each shred is executed on separate sequencers.
This approach, which may be conceptually viewed as a type of pipelining, is illustrated in
Returning to
The hints generated at block 958 may also include phrase-level optimizations. For example, runtime software may be aware of hardware resource allocation at any particular point in time (as opposed, for example, to scheduling optimizations performed by a compiler). Accordingly, the scheduling hints generator (see, e.g., 506 of
The hints generated at block 958 may also include transformation hints. For at least one embodiment, for example, a transformation hint may be utilized by the scheduler in order to perform load balancing. If the load instruction activity for each shred of a sequential phrase is unequal, but available sequencers on which to execute the shreds are of the same size, then the code for the shreds may be transformed in order to more equally distribute load instructions among the sequencers.
Further discussion of load balancing is made with reference to
Embodiments of the runtime library discussed herein support user-level shreds for any type of multi-sequencer system. Any user-level runtime software that supports user-level threads, including fibers, pthreads and the like, may utilize the techniques described herein. In addition, the scheduling mechanism and techniques discussed herein may be implemented on any multi-sequencer system, including a single-core SMT system (see, e.g., 310 of
For at least one embodiment, user-level shreds from the same application may run on all, or any subset, of OS-visible sequencers and/or OS-sequestered sequencers concurrently. Instead of merely sustaining a one-to-one mapping of application threads to OS threads and relying on the OS to manage the mapping between sequencers and threads, embodiments of the runtime library discussed herein may allow multiple user-level shreds in a single application image to run concurrently in a multi-sequencer system. For a single application program that is both multi-threaded and multi-shredded, embodiments of the present invention may thus support M:N thread-to-shred mapping so that N user-level shreds and M threads may execute concurrently on any or all sequencers in the system, whether OS-visible or OS-sequestered. (M, N≧1).
Such a runtime library as disclosed herein provides a contrast, for example, to systems which allow, at most, only one user-controlled “fiber” to execute per OS-visible thread. A fiber for such systems is associated with an OS-controlled thread, and two fibers from the same thread cannot be executed concurrently. For such contrasted systems, multiple user-level shreds from the same OS-controlled thread cannot execute concurrently.
For at least one embodiment of a runtime library as disclosed herein, the library (see, e.g., 500 of
Memory system 940 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory and related circuitry. Memory system 940 may store instructions 910 and/or data 912 represented by data signals that may be executed by processor 904. The instructions 910 and/or data 912 may include code and/or data for performing any or all of the techniques discussed herein. For example, the data 912 may include one or more queues to form a queue system 402 capable of storing shred descriptors as described above. Alternatively, the instructions 910 may include instructions to generate a queue system 402 for storing shred descriptors and may include scheduling logic 450.
The processor 904 may include a front end 920 that supplies instruction information to an execution core 930. Fetched instruction information may be buffered in a cache 225 to await execution by the execution core 930. The front end 920 may supply the instruction information to the execution core 930 in program order. For at least one embodiment, the front end 920 includes a fetch/decode unit 322 that determines the next instruction to be executed. For at least one embodiment of the system 900, the fetch/decode unit 322 may include a single next-instruction-pointer and fetch logic 320. However, in an embodiment where each processor 904 supports multiple thread contexts, the fetch/decode unit 322 implements distinct next-instruction-pointer and fetch logic 320 for each supported thread context. The optional nature of additional next-instruction-pointer and fetch logic 320 in a multiprocessor environment is denoted by dotted lines in
Embodiments of the methods described herein may be implemented in hardware, hardware emulation software or other software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented for a programmable system comprising at least one processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. For purposes of this application, a processing system includes any system that has a processor, such as, for example; a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
A program may be stored on a storage media or device (e.g., hard disk drive, floppy disk drive, read only memory (ROM), CD-ROM device, flash memory device, digital versatile disk (DVD), or other storage device) readable by a general or special purpose programmable processing system. The instructions, accessible to a processor in a processing system, provide for configuring and operating the processing system when the storage media or device is read by the processing system to perform the procedures described herein. Embodiments of the invention may also be considered to be implemented as a machine-readable storage medium, configured for use with a processing system, where the storage medium so configured causes the processing system to operate in a specific and predefined manner to perform the functions described herein.
Sample system 900 is representative of processing systems based on the Pentium®, Pentium® Pro, Pentium® II, Pentium® III, Pentium® 4, Itanium®, and Itanium® 2 microprocessors and the Mobile Intel® Pentium® III Processor—M and Mobile Intel® Pentium® 4 Processor—M available from Intel Corporation, although other systems (including personal computers (PCs) having other microprocessors, engineering workstations, personal digital assistants and other hand-held devices, set-top boxes and the like) may also be used. For one embodiment, sample system may execute a version of the Windows™ operating system available from Microsoft Corporation, although other operating systems and graphical user interfaces, for example, may also be used.
While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that changes and modifications can be made without departing from the scope of the appended claims. For example, the work queue system 702 may include a single queue that is contended by multiple sequencer types. For such embodiment, resource requirements are expressly included in each shred descriptor. Each sequencer's portion of the distributed scheduler does a check to make sure that the sequencer is capable of executing a shred before the shred's descriptor is removed from the work queue for execution by the sequencer.
Accordingly, one of skill in the art will recognize that changes and modifications can be made without departing from the present invention in its broader aspects. The appended claims are to encompass within their scope all such changes and modifications that fall within the true scope of the present invention.