Embodiments of the present invention relate to data processing and more particularly to performing scheduling of activities across multiple processors using architecture-aware feedback information.
The ability to process great amounts of data is becoming easier due to improved and greater processing resources. For example, many multiprocessor-type systems exist that can be used to implement complex algorithms to process great amounts of data. However, the complex nature of such architectures and the algorithms used for such data processing can lead to complexities and inefficiencies. For example, due to the nature of an algorithm's workload one or more processors of a system with multiple processors can have significantly varying loads. Such variances in workloads can lead to deleterious effects, including the presence of idle processors while other processors are overloaded.
Furthermore, tasks partitioned to the various processor resources may not efficiently use data associated therewith. Accordingly, inefficiencies in obtaining data from slower portions of a memory hierarchy can affect performance. Still further, the need for such data can require significant inter-processor communications, which increase data traffic and slow performance of useful work. Still further, some algorithms require various levels of tasks. As a result, synchronizations may be required at each level of the algorithm, which forces all processors to synchronize at these locations, which can be consuming events.
Different manners of attempting to resolve such issues exist. For example, multiple tasks may be attempted to be load balanced across the number of processors available in a given system. Typically, such load balancing is a static-based balancing performed at initiation of an algorithm. Accordingly, load imbalances can still occur during runtime.
Common algorithms that are suitable for execution on a multiprocessor system are algorithms that maybe parallelized to take advantage of the architecture. One type of application that is suitable for multiprocessor implementation is a pattern mining application, for example, frequent pattern mining may be used to find recurring patterns in data. Different types of pattern mining including itemset mining, sequence mining and graph mining. Generally a pattern mining algorithm is implemented to find all patterns satisfying user-specified constraints in a given data set. If the pattern exists in at least a minimum number of entries (e.g., corresponding to a minimum threshold), the pattern is considered frequent, and may be grown by one item. The same mining procedure is recursively applied to a subset of the data set, which contains the so-far-mined pattern. However, pattern mining applications as well as many other such applications can suffer from inefficient use of resources in a multiprocessor system.
In a graph mining application, interesting patterns are found in graph-based databases. The most efficient frequent pattern mining implementations use depth first traversals. In performing these traversals, it is possible to maintain key data elements through recursive calls. The extra state improves serial execution performance at the cost of increased memory usage. For example, while a parent scans a database, it keeps a list of possible ways/places to grow the graph. If this list is kept, then its children can further grow the graph with less effort. The extra state may be maintained in different ways, for example, as embedding lists or global candidate lists. Embedding lists are the mappings of the currently-growing graph to its positions in the database. Thus a child does not need to find the mappings again to grow the graph further. A global candidate list is a list of all the ways to grow the current graph. By keeping this information, the child does not need to walk the graphs to find all ways to grow the graphs, it just adds to the list when it adds a new node. Maintaining such state may hinder scalability if the algorithm is parallelized for a many-core architecture.
Certain graph mining algorithms create and maintain state across parent and children graphs, while others do not. Algorithms that maintain the state have the lowest execution time for a small number of processors because the parent state is reused in child subgraphs. However, such algorithms do not scale well because of increased memory usage and communication. In contrast, algorithms that do not reuse the state do not run as fast for a small number of processors because they have to recompute state across tasks because the algorithm does not maintain state.
Another area suitable for multiprocessor execution is algorithms for computational finance. The rapid increase in computational power coupled with the application of increasingly sophisticated mathematical and statistical methods have given rise to the discipline of computational finance. Portfolio optimization and option pricing are two areas of computational finance.
Although the algorithms of portfolio optimization and option pricing are very different from each other and from frequent pattern mining, the parallelization method is similar. The general principle used in parallelization is as follows: i) partition the total work into independent tasks; ii) additional tasks can be generated by exploring multiple levels of parallelism; and iii) independent tasks (whether of the same level or of different levels) can be performed simultaneously.
To provide effective use of multiprocessor resources, an algorithm may be parallelized for execution across a system, e.g., a multiprocessor system. The parallelization may take into account load balancing, data or cache localities, and minimal communication overhead between various processors of the system. In this way, an efficient scheme for using processing resources in a multiprocessor system may be provided. In some embodiments, a dynamic task partitioning scheme further may be implemented to effect load balancing. More particularly, a processing load may be shared by dynamically partitioning tasks into smaller tasks or subtasks based on the availability of additional processing resources. Furthermore, various embodiments may implement a cache-conscious task scheduling scheme. That is, the scheduling scheme may seek to reuse data of a given task over multiple tasks in related processing resources. While seeking to reuse data as much as possible between tasks, communication overhead between processors may also be minimized by choosing to repeat certain amounts of computation in exchange for reducing communication overhead.
In various embodiments, task partitioning and scheduling may be implemented based on various feedback information available in a system. Such information may include architectural feedback information such as information regarding processor utilization, environmental parameters and the like. The feedback information may further include information from a running application. More specifically, an application may use its knowledge of its own operation and its use of data to provide application (e.g., user-level) information regarding data locality. Based on the various feedback information received, a task scheduler may choose to partition pending tasks into additional tasks and further to control what processing resources are to perform the tasks based on the architectural feedback information and the data locality feedback information, for example.
Certain embodiments discussed herein are with respect to specific algorithms for frequent sequence mining and more particularly to recursive-based algorithms. However, it is to be understood that the scope of the present invention is not so limited, and in other embodiments task partitioning and scheduling as described herein may be effected in other applications.
Referring now to
Based on scheduling of the new task to a processor or multiple processors, the processor(s) may work on the current level (block 120). Next, it may be determined whether additional levels are present in the algorithm (diamond 130). If not, method 100 completes. If instead at diamond 130 it is determined that additional levels are present, control passes to diamond 140. There, it may be determined whether there are sufficient tasks for other resources (diamond 140). For example, in one embodiment the determination may be based on certain architectural feedback information that one or more processors are idle or are about to become idle, or one or more processors are not being fully utilized. While described herein as being on a per-processor basis, it is to be understood that the determination of resource availability may be on different levels of granularity, such as a per-thread basis, per-core basis, per-block basis, per-processor basis or other such granularities as desired in a particular implementation. Furthermore, different embodiments may have different thresholds or other criteria in determining whether sufficient tasks exist for a given system environment. For example, in some implementations a task scheduler may choose to partition tasks when there are fewer pending tasks than available processors, while in other implementations a scheduler may choose to split tasks when there are fewer tasks than total processors present in the system. Other variations of course are possible. Furthermore, the criteria may be based on processor utilization as opposed to strictly on idleness basis. For example, a given processor may report a relatively low utilization rate, indicating it has further processing resource availability. Such a less than fully-utilized processor may receive additional tasks or subtasks, in some embodiments.
Still referring to
If however at diamond 140 it is determined that insufficient tasks are available, control passes to block 160. There, one or more tasks may be split into multiple tasks, and one or more of these new tasks may be scheduled on different processors. Accordingly, control passes back to block 110, discussed above. While described in
Using a method in accordance with an embodiment of the present invention optimal load balancing may be realized. In a frequent pattern mining algorithm, for example, tasks may be partitioned dynamically at each level of recursion. Furthermore, runtime architecture feedback such as processor utilization, cache usage, network congestion, and the like may be used for load balancing. In the embodiment of
As discussed above, embodiments may be used in connection with frequent sequence mining or frequent pattern matching. Referring now to
Still with reference to
In some implementations, if all the processors are busy, a scheduler may choose to partition tasks based on a minimum threshold of available tasks, which can be a system parameter. While described with this particular partitioning of tasks and their assignment to particular processors in the embodiment of
Referring now to Table 1 below, shown is a parallel frequent sequence mining algorithm (i.e., PrefixSpan) to effect architecture-aware dynamic task partitioning. The general outline of this algorithm is as follows. Given a pattern that is initially known, and a data set (S), all occurrences of patterns in the data set may be identified. Further, infrequent items may be removed from the data set. Then, if a pattern exists in at least a minimum threshold of entries, the pattern is grown by one item and the search for new patterns in the data set is performed recursively. Note that in recursively performing the algorithm, a current task may either continue on a current processor, or one or more new tasks may be created for other processors based on architecture feedback information. More specifically, as shown in Table 1 when calling the recursion for p′ and S′, a decision is made whether the task will keep executing the recursion on the same processor, or whether to create a new task to be executed on other processors. As discussed above, various architectural feedback can be used for this decision.
In addition to task partitioning in accordance with an embodiment of the present invention, task scheduling may also be performed to improve cache usage. That is, pending tasks waiting to be executed may be stored in a task queue. Typically, tasks may be selected from the queue and performed in a predetermined order (e.g., first-in first-out (FIFO), last-in first-out (LIFO) or even random task scheduling). However, none of these methods take into consideration location of the data to be used in a given task. By using information obtained, e.g., from the application itself (i.e., a user-level application), dynamic task scheduling may be implemented in a cache-conscious manner that improves data usage, improves efficiency and reduces communication overhead. Different manners of implementing cache-conscious task scheduling may be performed. In some embodiments data locality information obtained from the application itself may be used in determining an appropriate processor to perform a particular task.
Referring now to
Initially in the first level, as shown in
Note that cache-conscious scheduling may take into consideration different levels of a memory hierarchy in certain embodiments. In this way, although a private cache of a given processor may not include data needed for a next task, a shared cache between, e.g., multiple cores of a chip multiprocessor (CMP) may contain the data, reducing the need for additional memory latencies and communication with a remainder of a memory hierarchy to obtain such data. Note that with respect to processors P1 and P3, when they complete their first level tasks (i.e., for items ‘b’ and ‘d’), assume that only tasks from items ‘a’ and ‘c’ are available for execution. Accordingly, these processors P1 and P3 do not include in their private cache data from item ‘a’ and item ‘c’.
However if P0 and P1 are two cores of a CMP system that share a middle-level cache (as are P2 and P3), task ‘aab’ may be assigned to P1, and task ‘ccd’ may be assigned to P3. In this way, reduced communication and latencies may be effected. That is, to obtain the needed data for performing the third-level tasks of item ‘a’ and item ‘c’ on either of processors P1 or P3, snoop traffic and/or memory requests need only extend to the middle-level cache to obtain the needed data. In contrast, if tasks were randomly assigned, the needed data may not be available without further communications and latencies to obtain the data from further distant portions of a memory hierarchy. Thus, a cache-conscious task scheduling algorithm in accordance with an embodiment of the present invention may improve performance by understanding the data sharing pattern of an algorithm and the cache architecture of a system, and applying the knowledge to task scheduling. Furthermore, feedback information from the application itself may be used to determine the data that is needed and further where that data has been recently used.
While different manners of implementing cache-conscious task scheduling may be performed, in some embodiments a data distance measure may be used. For example, each task that is to be performed may have a data usage identifier (ID) associated therewith. In one embodiment, a data usage ID for a given task may be a 3-tuple (i.e., address of the beginning of the accessed region, address of the end of the accessed region, and access pattern stride). Then the small Euclidean distance between two such data usage ID's will imply access to the same region. Of course, the distance calculation function may account for the fact that the region accessed by a first task is a sub-region of the region accessed by a second task. Each entry in a task queue may include the data usage ID along with a task identifier. Furthermore, a scheduler in accordance with an embodiment of the present invention may maintain a history of recent data usage IDs handled by each processor of the system. Accordingly, when a given processor has available resources, the scheduler may determine a data sharing distance measure based on a distance between the data usage IDs associated with the processor and the data usage IDs stored in the task queue. In this way, the task associated with the shortest distance may be assigned to that processor.
For example, if a processor becomes idle, a combination of data sharing distances between the data usage IDs of the processor's history and those of all available tasks may be calculated. The task with the shortest distance may be the task chosen for scheduling on the processor. Of course other manners of using data locality information may be effected in other embodiments.
Referring now to
Still referring to
Next, available processor resources may be determined (block 330). For example, a resource allocator may determine the availability of one or more processors or resources thereof based on the architectural feedback information. For example, if one or more processors are indicated to soon become available, such processors may be listed in a resource utilization queue, for example. Furthermore, depending on utilization rates of one or more processors, an underutilized processor or resources thereof may also be listed in the resource utilization queue.
Next, it may be determined whether sufficient next level tasks are available for the available processing resources (diamond 340). For example, a task partitioner may analyze the pending tasks, e.g., pending tasks in a task queue and compare it to the number of available resources. If sufficient tasks are available, control may pass to block 350. Note that the determination of sufficient tasks may be based on different criteria in different embodiments. For example, threshold-based criteria may be used in some embodiments. At block 350, if sufficient next level tasks exist, a resource scheduler may dynamically allocate next level tasks to the available processing resources based on the feedback information (block 350). For example, the resource scheduler may select a task to execute from the task queue on a given processor or resource thereof based on the feedback information, including data locality information. In this way, data that is present in a cache of a processor can be efficiently used or reused during execution of the next task without the need for obtaining the data from a memory hierarchy, improving latency and reducing communication overhead.
Still referring to
Referring now to
As further shown in
In addition, scheduler 450 may further be configured to determine whether to maintain state information of a given processor when executing a new task thereon. For example, as will be described further below, when a processor is to begin executing a subtask of a previously performed task on the processor, scheduler 450 may choose to maintain the state to improve performance and cache localities. In contrast, if a subtask of a previously executed task is to be performed on a different processor, the state is not communicated to the new processor, thus reducing memory footprint and communication overhead.
Still referring to
This available resource information may be provided to task partitioner 460 and data locality analyzer 470. Task partitioner 460 may further receive feedback information from user-level application 420, e.g., data locality information, along with an indication of pending tasks in task queue 465. Based on the available resources as indicated by resource allocator 455 and the pending tasks in task queue 465, task partitioner 460 may choose to dynamically partition one or more of the tasks, e.g., of a next level of an application into multiple tasks. Accordingly, task partitioner 460 may provide the partitioned tasks to task queue 465. Task queue 465 may thus include entries for each pending task. In some implementations, in addition to the identification of pending tasks, each entry in task queue 465 may further include data locality information to indicate the location of data needed by the task.
In some embodiments, data locality analyzer 470 may receive pending entries from task queue 465, along with the identification of available resources, e.g., from resource allocator 455. Based on this information, data locality analyzer 470 may determine distances between data needed by the various tasks and the available resources. These data distances may be provided to resource scheduler 475. In various embodiments, resource scheduler 475 may select a task for execution on a particular processor (or resource thereof) based on the smallest data distance for a given processor or resource for the various tasks pending in task queue 465. Accordingly, resource scheduler 475 may provide control information to the selected processor or resource of multiple processor resources 410 and user-level application 420 to cause the selected pending task to be executed on the selected available resource. While described with this particular implementation in the embodiment of
By dynamically determining when to maintain state of a mining operation, temporal locality of a cache may be increased, without degradation in load balancing, memory usage, and task dependency. Accordingly, in certain embodiments the state of a parent task may be dynamically maintained to speed up a given algorithm (such as graph mining) when its descendant tasks are processed on the same processor. A parent task may generate a significant amount of state that may be shared in descendant tasks. For example, such state may take the form of an embedded list or a global candidate list.
When these descendant tasks run on the same processor, much of the state may still be in the processor's local cache, thus improving cache locality and reducing execution time. However, maintaining state may be too prohibitive if the memory footprint is large or may require too much communication if descendant tasks are assigned to different processors. This scheme thus may automatically eliminate state to maintain optimal footprint size or reduce communication when descendant tasks are assigned to different processors. Essentially, it is a dynamically hybrid scheme that either maintains state or does not reuse state, depending on where a task is to be performed. Table 2 is pseudocode of a depth-first algorithm that uses an embodiment of the present invention to mine frequent graphs.
As shown in Table 2, this algorithm traverses the search space in depth first order, dynamically maintaining state (i.e., embeddinglist el) for descendent graphs. Portions of the subtask that will run on the same processor make use of the state of the parent (line 11). If instead the task is partitioned and assigned to another processor (line 10), the parent state is stripped out to minimize communication and/or memory footprint. This maintains a high level of temporal locality in the cache (when state is maintained), while still allowing the work to be partitioned at each recursion level (when state is removed). Thus, the algorithm balances concurrency with cache reuse. In addition, dynamic state maintenance facilitates mining much larger datasets because the system can adapt to the footprint of the workload. For this case, parent state is not preserved for descendant tasks. These decisions may be made at runtime based on feedback from hardware or runtime of the system. The runtime/hardware information such as memory utilization, processor affinity, neighbor processors, and communication latency may be used to determine whether state should be maintained across parent and descendant tasks. The end result is decreased execution times due to increased cache locality.
Referring now to
Control passes, either from diamond 615 or block 618 to block 620. There, the current task may be processed using the state information either present in the processor or created in the processor (block 620). Upon completion of the current task, control passes to diamond 640. There it may be determined whether the next task is to be performed on the same processor, e.g., the same processor core (diamond 640). If so, the current state may be kept for reuse in connection with the new task (block 650). Accordingly, control passes back to diamond 615, discussed above.
If instead the next task is not to be performed on the same processor (e.g., core), control passes to block 660. There, the state is not maintained (block 660). Accordingly, control passes back to block 610 for obtaining a new task to be performed on the given processor core. While described with this particular implementation in the embodiment of
Referring now to
Thus as shown in
Thus embodiments of an architecture-aware dynamic task partitioning scheme may be more efficient than static task partitioning. Task partitioning techniques in accordance with one embodiment may always maintain an optimal number of tasks regardless of system dynamics. In this way, an appropriate level of parallelism can be exploited without incurring significant parallelization overhead. Further, embodiments may adapt to different systems/environments better than static partitioning. That is, because on-the-fly feedback information is obtained from hardware, task scheduling may adapt to different systems or dynamic environments readily without human intervention. For example, if new processors are added, partitioning may automatically rebalance the parallelism to take advantage of the additional resources.
Cache-conscious task scheduling in accordance with an embodiment may use cache memory more efficiently because it leverages the knowledge of algorithmic data reuse and cache hardware architecture. For the same data set, a smaller working set may be maintained than a conventional algorithm. Thus a smaller cache may be used to achieve a desired level of performance, or a larger cache can handle a larger data set. Thus embodiments of the present invention may simultaneously implement good load balancing, good cache localities, and minimal communication overhead.
As mentioned above, embodiments are also suitable for computational financial algorithms. Portfolio optimization is the problem of finding the distribution of assets that maximizes the return on the portfolio while keeping risk at a reasonable level. The problem can be formulated as a linear or non-linear optimization problem. Such a problem is commonly solved using an interior-point method (IPM). The crux of IPM is a direct linear solver. Sparse direct solver computation is defined by an elimination tree (ET)—a task dependence graph which captures the order of updates performed on the rows of sparse matrix.
Using an embodiment of the present invention, a more efficient task partitioning is achieved. The process is as follows: during the early stage of the computation, tasks are partitioned based on coarse grain parallelism. When all coarse level tasks are dispatched, the application receives hardware feedback information. If there are still resources available, the application explores additional levels of parallelism to generate additional tasks to keep all resources busy. This process repeats and enables the application to explore deeper and deeper levels of parallelism as the computation moves towards the top of the ET (when not too many independent tasks remain).
When selecting a task to schedule on a processor, choosing a task that has more shared data with a “just completed” task reduces unnecessary cache misses and improves system performance. Such scheduling may be implemented via analysis of data usage ID's, in some embodiments. In the portfolio optimization scenario, this applies to schedule tasks from the same ET branch on the same processor. Since these tasks share much data, it would be best to schedule them on the same processor. Sometimes, however, when not enough tasks are available, these tasks can be scheduled on different processors. Under this circumstance, the scheduling algorithm may weight the benefit of parallel execution and cost of data contention and data migration.
Options are traded financial derivatives commonly used to hedge the risk associated with investing in other securities, and to take advantage of pricing anomalies in the market via arbitrage. There are many types of options. The most popular options are European options which can only be exercised at expiry, and American options which can be exercised any time up to the expiry. There are also Asian options whose payoff depends on time, and multi-asset options whose payoff depends on multiple assets. The key requirement for utilizing option is calculating their fair value.
A binomial tree (BT) is one of the most popular methods for option pricing today. It can be used to price single or multiple (two and three) assets options. A serial implementation of a BT algorithm including two nested loops is shown in Table 3.
One partition would be to divide the problem in the time domain (i.e., the outer loop) and synchronize at the end of each time step. However, this simple partitioning suffers from too much synchronization overhead. Tile partitioning, as shown in
Similar to parallel task partitioning for frequent pattern mining and IPM direct solver, the binomial tree computational flow is affected by two fundamental partitioning issues. The first partitioning issue has to do with balance between amount of parallelism to explore and data locality. In the tile partitioning, the amount of parallelism available depends on the tile size. As the tile size reduces, the number of independent tiles increases; thus, the available parallelism increases. However, as the tile size reduces, the amount of computation and data reuse reduces. However, the task management overhead, such as boundary updates as well as task scheduling by the task scheduler increases. The second partitioning issue is related to the tree topology. As the computation advances (i.e., moving up the tree), fewer and fewer nodes are available. Eventually, there is only one node left. Since the amount of parallelism available is proportional to the number of independent nodes, parallelism reduces as computation advances.
A dynamic partitioning and locality-aware scheduling scheme may be applied to the binomial tree algorithm in the following way. At the beginning of the problem, a tile size is selected to provide adequate parallelism and good locality. As the computation advances, the amount of parallelism reduces. Occasionally, the application checks the status of available resources through architectural feedback mechanisms. Should the application find that resources are idle for an extended period of time, it can adjust the tile size to increase the number of parallel tasks. The frequency in which the application checks resource status affects the quality of load-balance. Frequent checking may results in a slightly more balanced execution, but will also lead to higher overhead. Thus a dedicated runtime balance checking system may obtain optimal performance.
When scheduling tasks on available resources, the same principle of favoring a task with more shared data from the “just finished” task still applies. Due to overhead incurred in moving data from one cache to another (due to contention or migration), random or systematic scheduling often leads to sub-optimal performance. Thus in scheduling tasks, the amount of data sharing by two tasks may be measured by a data sharing distance function, as described above.
Embodiments may be implemented in many different system types. Referring now to
First processor 770 and second processor 780 may be coupled to a chipset 790 via P-P interconnects 752 and 754, respectively. As shown in
In turn, chipset 790 may be coupled to a first bus 716 via an interface 796. In one embodiment, first bus 716 may be a Peripheral Component Interconnect (PCI) bus, as defined by the PCI Local Bus Specification, Production Version, Revision 2.1, dated Jun. 1995 or a bus such as the PCI Express bus or another third generation input/output (I/O) interconnect bus, although the scope of the present invention is not so limited.
As shown in
Embodiments may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.