An embodiment relates to partitioning a set of tasks on an electronic control unit.
A multi-core processor integrated within a single chip and is typically referred to as a single computing unit having two or more independent processing units commonly referred to as cores. The cores typically carry out read and execute programmed instructions. Examples of such instructions are adding data and moving data. An efficiency of the multi-core processor is that the cores can run multiple instructions at the same time in parallel.
Memory layouts affect the memory bandwidth for cache enabled architecture for an electronic control units (ECU). For example, if a multi-core processor is inefficiently designed, bottlenecks in retrieving data may occur if the tasks among multiple cores are not properly balanced, which also affects communication costs.
An advantage of an embodiment is optimizing access of data in a global memory so that data stored in a respective location and accessed by a respective task in processed by a same respective core. In addition, the workload among the cores is balanced among the respective number of cores of the multi-core processor so that each of the respective cores performs a similar amount of workload processing. The embodiments described herein generated a plurality of permutations based on re-ordering techniques for pairing respective tasks with respective memory locations based on accessing memory locations. Permutations are divided and subdivided based on the number of cores desired until a respective permutation is identified that generates a balanced workload among the cores as well as minimizing communication costs.
An embodiment contemplates a method of partitioning tasks on a multi-core electronic control unit (ECU). A signal list of a link map file is extracted in a memory. The link map file includes a text file that details where data is accessed within a global memory device. Memory access traces relating to executed tasks from the signal list are obtained. A number of times each task accessed a memory location and the respective task workload on the ECU is identified. A correlation graph is generated between each task and each accessed memory location. The correlating graph identifies a degree of linking relationship between each task and each memory location. The correlation graph is reordered so that the respective tasks and associated memory locations having greater degrees of linking relationships are adjacent to one another. The multi-core processor is partitioned into a respective number of cores, wherein allocating tasks and memory locations among the respective number of cores is performed as a function of substantially balancing workloads among the respective cores.
A map link file 14 is a text file that details where data and code is stored inside your executables within the global memory device 12. The map link file 14 includes trace files that contain an event log describing what transactions have occurred within the global memory device 12 as to where code and data are stored. As a result, a link file map 14 may be obtained identifying all the tasks and the associated memories addresses that were accessed when the application code was executed by the ECU 10.
A mining processor 16 is used to perform data mining 18 from the global memory device 12, reordering tasks and associated memory locations 20, identifying workloads of a permutation 22, and partitioning tasks and associated memory locations 24 for designing the multi-core processor.
In regards to data mining, for each task (e.g., A, B, C, D) a memory access hit count table is constructed as illustrated in
After the matrix X is generated, the mining processor generates permutations that are used to identify the respective permutation that will provide the most efficient partitioning to evenly distribute the workload of the ECU.
Permutations are various listings of ordering tasks and memory locations. As shown in
The reordering of the vertices of the bipartite graph is performed using a weighted adjacent matrix
constructed using the matrix X in
minJ(π)=Σl=1N−1l2Σi=1N−lwπ
This is equivalent to finding the inverse permutation π−1 such that the following energy function is minimized:
Solving the above problem is approximated by computing the eigenvector (q2) with the second smallest eigenvalue for the following eigen equation:
(D−W)q=λDq
where the Laplacian matrix L=D−W, the degree matrix D is a diagonal, and defined as
The thus-obtained q2 is sorted in ascending order. The index of the vertices after sorting is the desired permutation {πi, . . . , πN}. The order of task nodes and memory nodes is then derived from this permutation by rearranging the task nodes and memory nodes in the bipartite graph according to the permutation result.
As illustrated in
To even out the workload assure that the workload of the cores are evenly distributed, the first two pairs of task nodes and associated memory nodes having a highest workload among the plurality of task nodes are split and positioned at opposite ends of the bipartite graph. This assures that these two respective task nodes having the highest workload among the plurality of tasks will not be within a same core which would otherwise overload the workload for a single core. After these two pair of tasks are reordered, a next pair of tasks and associated memory nodes having a next highest workload among the remaining task nodes and memory nodes are split and positioned next to the previous split task nodes and memory nodes. This procedure continues with a next respective pair of task nodes and associated memory nodes having a next highest workload among the available task nodes and associated memory nodes until all available task nodes and associated memory nodes are allocated within the bipartite graph. This results in an even distribution of workloads such that the bipartitan graph may be divided equally in the middle as shown and the workload distribution between the respective cores are substantially similar. As shown in the bipartitan graph in
Moreover, once the two cores have been partitioned, if additional partitioning of cores are required (e.g., 4 core), then the partitioned cores may be subdivided again, without reordering, based on workload balancing and minimizing communication costs. Alternatively, the reordering technique may be applied if desired to an already portioned core to reorder the respective tasks and memories therein and then subdivide the cores further.
Various permutations of partitioning may be applied to find the most efficient partition that produce the most balance workload between the cores of the processor and also minimize communication costs.
In step 31, a signal list is extracted from a link map file in a global memory. The signal list identifies traces of memory locations hit by the tasks executed by the application codes.
In step 32, the memory access traces are collected by a mining processor.
In step 33, a matrix is constructed that includes the task memory access count (i.e., hits) for each memory location. It should be understood that respective tasks and respective memory locations would not have any hit, and under such circumstances, the entry will be shown as a “0” or left blank indicating that the task did not access the respective location.
In step 34, various permutations are generated that include correlation graphs (e.g., bipartite graphs) that show the linking relationships between the tasks nodes executed by the application code and respective memory nodes accessed by the task nodes. Each of the permutations utilizes optimum ordering algorithms for determining the respective order of the task nodes and associated memory nodes. Task nodes are correlated with those memory nodes having hits between one another and are disposed adjacent to one another. The task nodes and associated memory nodes are optimally positioned in the correlation graph so that when partitioned, workload usages within the cores of the processor are substantially balanced.
In step 35, the correlation is partitioned for identifying which tasks are associated with which core when the tasks are executed on the ECU. The partition will select a split with respect to the respective task nodes and associated memory nodes based on the balance workload and minimized communication costs. Additional partitioning is performed based on the required number of cores in the ECU.
In step 36, the selected permutation is used to design and produce the task partitioning of the multi-core ECU.
While certain embodiments of the present invention have been described in detail, those familiar with the art to which this invention relates will recognize various alternative designs and embodiments for practicing the invention as defined by the following claims.