The systems and methods for scheduling thread execution among a plurality of processors based on evaluation of memory access data in accordance with the present invention are further described with reference to the accompanying drawings in which:
Certain specific details are set forth in the following description and figures to provide a thorough understanding of various embodiments of the invention. Certain well-known details often associated with computing and software technology are not set forth in the following disclosure, however, to avoid unnecessarily obscuring the various embodiments of the invention. Further, those of ordinary skill in the relevant art will understand that they can practice other embodiments of the invention without one or more of the details described below. Finally, while various methods are described with reference to steps and sequences in the following disclosure, the description as such is for providing a clear implementation of embodiments of the invention, and the steps and sequences of steps should not be taken as required to practice this invention.
When scheduling execution of threads on multicore computer chips it is very important to have good information about their locality of accesses in the instruction and data caches. This is because some threads are related and it is impractical to assign them to different processors, while other threads can be more and less compatible, resulting in more and less advantage to assigning them to different processing cores. Current processors have only limited and model-specific hardware performance counters. These count low-level processor-internal hardware events, e.g., branch mispredicts and cache line fills. Some processors allow the operating system to receive an interrupt when these counters reach a particular value. Operating systems for multicore machines benefit from a more complete set of performance counters, as provided herein, which allow the operating system to cheaply determine the cache and memory-system footprints of threads allowing them to be assigned to cores in a more principled fashion.
It will be appreciated that a multicore computer chip 200 such as that of
Components of chip 200 may be grouped into functional groups. For example, router 282, shared memory 203, a scheduler running on processor 269, cache 230, main CPU 210, crypto processor 240, watchdog processor 250, and key storage 295 may be components of a first functional group. Such a group might generally operate in tighter cooperation with other components in the group than with components outside the group. A functional group may have, for example, caches that are accessible only to the components of the group.
A multicore computer chip such as 320 may have multiple processors 331-334 each with various levels of available cache. For example, each processor 331-334 may have a private level one cache 341-344, and a level two cache 351 or 352 that is available to a subgroup of processors, e.g. 331-332 or 334-334, respectively. Any number of further cache levels may also be accessible to processors 331-334, e.g. level three cache 361 which is illustrated as being accessible to processors 331-334. The interoperation of processors 331-334 and the various ways in which caches 341-344, 351-352, and 360 are accessed may be controlled by logic in the processors 331-334 themselves, e.g. by one or more modules in a processor's instruction set. This may also be controlled by OS 310 and applications 301-303.
An API 401 is a computer process or mechanism that allows other processes to work together. In the familiar setting of a personal computer running an operating system and various applications such as MICROSOFT WORD® and ADOBE ACROBAT READERS, an API allows the applications 411-413 to communicate with the operating system 400. An application 411 makes calls to the operating system API 401 to invoke operating system 400 services. The actual code behind the operating system API 401 is typically located in a collection of dynamic link libraries (“DLLs”).
An API 401 can be implemented in the form of computer executable instructions. These instructions can be embodied in many different forms. Eventually, instructions are reduced to machine-readable bits for processing by a computer processor 471. Prior to the generation of these machine-readable bits, however, there may be many layers of functionality that convert an API 401 implementation into various forms. For example, an API that is implemented in C++ will first appear as a series of human-readable lines of code. The API will then be compiled by compiler software into machine-readable code for execution on a processor.
Recently, the proliferation of programming languages, such as C++, and the proliferation of execution environments, such as the PC environment, the environment provided by APPLE® computers, handheld computerized devices, cell phones, and so on has brought about the need for additional layers of functionality between the original implementation of programming code, such as an API implementation, and the reduction to bits for processing on a device. Today, a computer program initially created in a high-level language such as C++ will be first converted into an intermediate language such as MICROSOFT® Intermediate Language (MSIL) or JAVA®. The intermediate language may then be compiled by a Just-in-Time (JIT) compiler immediately prior to execution in a particular environment. This allows code to be run in a wide variety of procession environments without the need to distribute multiple compiled versions. In light of the many levels at which an API 401 can be implemented, and the continuously evolving techniques for creating, managing, and processing code, the invention is not limited to any particular programming language or execution environment. The implementation chosen for description of various aspects of the invention is in no way intended to limit the invention to this implementation.
The scheduler 402 can be a process associated with the operating system 400. The scheduler 402 manages execution of applications 411-412 by assigning operations among the different processors 471, 481, 485, 491. The scheduler 402 therefore manages the resources used by application processes and threads. A brief general description of processes and threads will serve to point out the resources that are managed in this regard.
An instance of an application is known as a process. Every process has at least one thread, the main thread, but can have many. Each thread represents an independent execution mechanism. Any code that runs within an application runs via a thread. In a typical arrangement, each process is allotted its own virtual memory address space by an operating system. All threads within the process share this virtual memory space. Multiple threads that modify the same resource must synchronize access to the resource in order to prevent erratic behavior and possible access violations. In this regard, each thread in a process gets its own set of volatile registers. A volatile register is the software equivalent of a CPU register. In order to allow a thread to maintain a context that is independent of other threads, each thread gets its own set of volatile registers that are used to save and restore hardware registers. These volatile registers are copied to/from the CPU registers every time the thread is scheduled/unscheduled to run by a typical operating system.
In addition to the set of volatile registers that represent a processor state, typical threads also maintain a stack for executing in kernel mode, a stack for executing in user mode, a thread local storage (“TLS”) area, a unique identifier known as a thread ID, and, optionally, a security context. The TLS area, registers, and thread stacks are collectively known as a thread's context. Data about the thread's context must be stored and accessible by a processor that is executing a thread, so that the processor can schedule and execute operations for the thread.
In light of these resources that must be maintained by a computer for running threads, it will be acknowledged that threads are not “free,” they consume a significant amount of system resources and it is desirable to minimize the use of additional threads running on a single processor such as 471 by outsourcing them, if possible, to other processors such as 481, 485, and 491. More specifically and with reference to the above discussion of threads, each thread consumes a portion of system memory 451 that cannot be moved to a new location, and is therefore a resource-intensive use of memory 451. Operations for each running thread must be scheduled for execution either serially or on a priority basis, and time spent scheduling operations, rather than performing operations, consumes processor resources. There is also non-trivial overhead associated with switching between threads. This “context-switch overhead” is dominated by the cost of flushing the old thread's data from the cache(s) and the large number of cache misses incurred by the new thread. Each thread is allotted an amount of processor time based on the number of running threads, so more running threads will reduce the amount of processor time per thread.
Scheduler 402 or an associated operating system 400 module can select a processor, e.g., 471 from said plurality of processors 471, 481, 485, 491 to execute a thread. The processor selection may be made based on which processor 471, 481, 485, or 491 can best handle the thread in question. Thus, scheduler 402 can select a processor 471, 481, 485, or 491 after consulting information comprising an identity of threads that may be simultaneously executing on said plurality of processors 471, 481, 485, 491. Such selection can be accomplished just as in multi-processor aware operating systems available today that provide an API for restricting the set of processors on which a thread is allowed to execute. This is commonly known as thread affinity.
For example, consider a scenario in which 10 threads are simultaneously executing on processors 471, 481, 485, and 491. Threads 1, 2, and 3 are executing on processor 471. Threads 4, 5, and 6 are executing on processor 481. Threads 7, 8, and 9 are executing on processor 485. Thread 10 is executing on processor 491. “Simultaneously executing” should be understood to mean the thread is presently associated with a processor such that thread instructions either are or will soon be executing on the processor. The thread is part of the processor's current workload, but it is possible that the thread's instructions are not currently executing because some other thread is currently executing.
Now, for example, a new thread, thread 11, is started by the operating system 400. The scheduler 402 must assign thread 11 to a processor. In accordance with an embodiment of the invention, the scheduler consults the identity of threads executing on processors 471, 481, 485, 491 prior to determining which processor thread 11 will be assigned to. Thread identity can be, for example a thread ID, or some other information that identifies the thread. Thread identity may uniquely identify the thread or identify a class of threads of which the thread is a member. Thread identity therefore is any information which distinguishes a thread from at least one other thread.
Thread identity is consulted because scheduler 402 may have information regarding thread compatibility. For example, the scheduler may select a single processor 471 from a plurality of processors 471, 481, 485, and 491 for execution of two or more related threads. The scheduler 402 may select two or more separate processors 471 and 481 from the plurality of processors 471, 481, 485, and 491 for execution of incompatible threads.
Information as to whether threads are related or incompatible, or as to a degree of compatibility of threads may be gathered, for example, by hardware extensions 473, 483, 487, and 493, which collect and store memory access data 452 in memory 451. For example, when two threads are executing of a processor 471, hardware extension 473 can measure information such as frequency of cache access, number of memory locations a thread is accessing, size of working set, cache hits, and cache misses. This information can be stored in memory 451 as memory access data 452. While hardware extensions 473, 483, 487, and 493 are illustrated in an on-chip or processor integrated configuration, this is not required and 473, 483, 487, and 493 may just as well be memory units located off-chip, such as an implementation in which this function can be performed by a computer's main memory.
Memory access data 452 may be evaluated by evaluation module 403. Evaluation module 403 can evaluate memory access data 452 to determine whether two or more threads are prospectively compatible for simultaneous execution on a single processor 471, incompatible for simultaneous execution on a single processor 471, or a degree of compatibility for simultaneous execution on a single processor 471. In order to gather the memory access data, it may be that the two or more threads were executed by a single processor 471. However, if such a processor assignment resulted in low performance, those threads can be assigned to different processors prospectively. Thread compatibility information 453 can be stored by evaluation module 403 and consulted when starting a new thread, or when migrating an existing thread to a new processor.
Thread compatibility information 453 may also be used by scheduler 402 to adjust a thread scheduling frequency. Some threads benefit from longer uninterrupted execution times, while other threads can be context-switched more frequently. Evaluation module 403 may determine an optimum scheduling frequency for threads for situations in which multiple threads must be assigned to a same processor.
Another aspect of the invention, which may also be appreciated from
As should be clear from the above, memory access data referenced in
Starting with step 501, in one contemplated embodiment of the invention, an application may call an operating system API to start a first thread 501. The operating system may start the desired thread on a first processor 503. Next, an application which may be a same or different application calls the operating system API to start a second thread 502. Assuming no pre-existing information about thread compatibility, the operating system may start the second thread on the first processor as well 504.
A hardware extension associated with the first processor may now collect memory access data to determine the compatibility of the two threads 505. In the case of related threads, for example, threads associated with a single application that frequently share and update data, the operating system or some evaluation module may evaluate memory access data to determine an optimum scheduling frequency 506. An optimum scheduling frequency may be associated with some thread identification information. When the related threads are subsequently running on a processor, the operating system may adjust the scheduling frequency for optimum performance 507.
In the case of unrelated threads, the operating system or some evaluation module may evaluate memory access data to determine compatibility of the threads 508. Information regarding compatibility, which may include a degree of compatibility and/or an optimum scheduling frequency to be used when the threads are to be executed by a same processor, may be associated with thread identification information. The threads may subsequently be assigned on separate processors as necessary 509. If the threads are very compatible, they may subsequently be placed on a same processor, at an optimum scheduling frequency. If they are marginally compatible or considered incompatible, they may assigned to different processors if possible.
An application may be pre-tested for thread compatibility with other application threads for example by the application programmer, distributor, or a third-party testing service. The information may be provided to an end-user computing device such as that of
The invention is operational with numerous general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, cell phones, Personal Digital Assistants (PDA), distributed computing environments that include any of the above systems or devices, and the like.
In addition to the specific implementations explicitly set forth herein, other aspects and implementations will be apparent to those skilled in the art from consideration of the specification disclosed herein. It is intended that the specification and illustrated implementations be considered as examples only, with a true scope and spirit of the following claims.