Symmetric multiprocessing (SMP) is a computer architecture that provides fast performance by making multiple CPUs available to complete individual processes simultaneously (multiprocessing). Any idle processor can be assigned any task, and additional CPUs can be added to improve performance and handle increased loads. A chip multiprocessor (CMP) includes multiple processor cores on a single chip, which allows more than one thread to be active at a time on the chip. A CMP is SMP implemented on a single integrated circuit. Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. A goal of SMP/CMP is to allow greater utilization of TLP.
Parallel programming languages (e.g., OpenMP, TBB, CILK) are used for writing multithreaded applications. The tasks to be performed in a multithreaded application may assign different tasks to different CPUs 110. Data that is used by multiple CPUs 110 is shared via the shared memory 150. Utilizing the shared memory 150 may result in long latency memory access and high bus bandwidth. It's the responsibility of the hardware to guarantee consistent memory access which may introduce performance overhead.
The features and advantages of the various embodiments will become apparent from the following detailed description in which:
Referring back to
To exploit data locality in clustered SMP (e.g., 100), an application has to pass data sharing information to the architecture. The data sharing information may be passed either through a programmer's annotation by language extensions (annotations), or through compiler analysis on data flow. A parallel programming language (e.g., OpenMP, TBB, CILK) may be used to generate a multithreaded application.
The tasks are queued according to the assigned cluster and scheduling of the tasks is based on assigned clusters. The scheduling may be done centrally for the entire SMP or may be done in a distributed fashion.
When a CPU is idle it may scan the task list 400 looking for a task having a cluster ID assigned to, or associated with, the CPU. In order to determine what CPU is assigned to what cluster a table may be used or the CPU ID may be divided by the number of CPUs in a cluster with the quotient being the cluster ID (e.g., if there are 16 CPUs numbered 0-15 and each cluster has 4 CPUs, CPUs 0-3 would be in cluster 1). If the CPU finds a task associated with its cluster it fetches the task from the list and executes it. If there are no tasks for the idle CPUs cluster, and task stealing is enabled, the CPU may pick any ready task from the task list and execute the task for load balancing purposes. According to one embodiment, preference may be given to tasks in close clusters (e.g., an idle CPU in cluster 0 would give preference to a task in cluster 1 over cluster 2) so that the data produced and consumed is as close as possible.
Because the centralized scheduling scheme employs a global list, the load can be distributed evenly across all CPUs. However, all tasks must be added to this global queue, the CPUs must scan this list looking for next tasks, and the tasks must be dequeued when finding a matching one. All of these accesses may introduce access contention overhead. Moreover, the global list would be to be stored in memory (e.g., 150) which would require bus bandwidth to access.
Because the disturbed queues are local to the cluster (e.g., stored in cache 140 of
The determination of whether to utilize distributed or centralized scheduling may be based on the number of CPUs and clusters in the system. The more CPUs and clusters that there are the more overhead that may be necessary to have a centralized scheduler and the more beneficial it may be to implement a distributed scheduler. The determination may be made by the programmer. For example, the newcluster annotation may mean centralized while a newclusterd annotation may mean distributed.
Using the locality of the CPUs (e.g., 110) and the shared cache (e.g., 140) increases the efficiency and speed of a multithreaded application and reduces the bus traffic. The efficiency and speed may be further increased if data can be passed from the shared cache to local cache (e.g., 130) prior the CPU 110 needing the data. That is, the data for the cluster of CPUs is provided to local cache for each of the CPUs within the cluster. A mechanism such as data forwarding may be utilized on “consumer” CPUs to promote sharing of the data (e.g., produced by a producer CPU) from shared cache to higher level local cache. Data prefetching may also be used to prefetch the data from the shared cache to the local cache. In a system with multiple layers of local cache it is possible to prefetch or forward the data to the highest level (e.g., smallest size and fastest access time).
Various embodiments were described above with specific reference made to multithreaded applications with a single producer and multiple consumers. The various embodiments are in no way intended to be limited thereby. For example, a single producer could be associated with a single consumer or multiple producers could be associated with a single consumer or multiple consumers without departing from the scope. Moreover, parallel tasking could be implemented without departing from the scope. For example, the producer CPU could produce data in series and a plurality of consumers could process the results of the producer in parallel. The parallel tasking could be implemented using taskq and task pragmas where a producer produces data in series and the consumers execute tasks on the data in parallel.
Various embodiments were described above with specific reference made to the OpenMP parallel processing language. The various embodiments are in no way intended to be limited thereto but could be applied to any parallel programming language (e.g., CILK, TBB) without departing from the current scope. The compilers and libraries associated with any parallel programming language could be modified to incorporate clustering by locality.
The various embodiments were described with respect to multiprocessor systems with shared memory (e.g., SMP, CMP) but are not limited thereto. The various embodiments can be applied to any system having multiple parallel threads being executed and a shared memory amongst the threads without departing from the scope. For example, the various embodiments may apply to systems that have a plurality of microengines that perform parallel processing of threads.
Although the disclosure has been illustrated by reference to specific embodiments, it will be apparent that the disclosure is not limited thereto as various changes and modifications may be made thereto without departing from the scope. Reference to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described therein is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
An embodiment may be implemented by hardware, software, firmware, microcode, or any combination thereof. When implemented in software, firmware, or microcode, the elements of an embodiment are the program code or code segments to perform the necessary tasks. The code may be the actual code that carries out the operations, or code that emulates or simulates the operations. A code segment may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc. The program or code segments may be stored in a processor readable medium or transmitted by a computer data signal embodied in a carrier wave, or a signal modulated by a carrier, over a transmission medium. The “processor readable or accessible medium” or “machine readable or accessible medium” may include any medium that can store, transmit, or transfer information. Examples of the processor/machine readable/accessible medium include an electronic circuit, a semiconductor memory device, a read only memory (ROM), a flash memory, an erasable ROM (EROM), a floppy diskette, a compact disk (CD-ROM), an optical disk, a hard disk, a fiber optic medium, a radio frequency (RF) link, etc. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc. The code segments may be downloaded via computer networks such as the Internet, Intranet, etc. The machine accessible medium may be embodied in an article of manufacture. The machine accessible medium may include data that, when accessed by a machine, cause the machine to perform the operations described in the following. The term “data” here refers to any type of information that is encoded for machine-readable purposes. Therefore, it may include program, code, data, file, etc.
All or part of an embodiment may be implemented by software. The software may have several modules coupled to one another. A software module is coupled to another module to receive variables, parameters, arguments, pointers, etc. and/or to generate or pass results, updated variables, pointers, etc. A software module may also be a software driver or interface to interact with the operating system running on the platform. A software module may also be a hardware driver to configure, set up, initialize, send and receive data to and from a hardware device.
An embodiment may be described as a process which is usually depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination corresponds to a return of the function to the calling function or the main function.
The various embodiments are intended to be protected broadly within the spirit and scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
PCT/CN06/03534 | Dec 2006 | CN | national |