This invention relates to a technique for speeding up the execution of a program in a multi-core or multiprocessor system.
Recently, a so-called multiprocessor system having multiple processors has been used in the fields of scientific computation, simulation and the like. In such a system, an application program generates multiple processes and allocates the processes to individual processors. These processors go through a procedure while communicating with each other using a shared memory space, for example.
As a field of simulation, the development of which has been particularly facilitated only recently, there is simulation software for plants of mechatronics such as robots, automobiles and airplanes. With the benefit of the development of electronic components and software technology, most parts of a robot, an automobile, an airplane or the like are electronically controlled by using wire connections laid like a network of nerves, a wireless LAN and the like.
Although these mechatronics products are mechanical devices in nature, they also incorporate large amounts of control software. Therefore, the development of such a product has required a long time period, enormous costs and a large pool of manpower to develop a control program and to test the program.
As a conventional technique for such a test, there is HILS (Hardware In the Loop Simulation). Particularly, an environment for testing all the electronic control units (ECUs) in an automobile is called full-vehicle HILS. In the full-vehicle HILS, a test is conducted in a laboratory according to a predetermined scenario by connecting a real ECU to a dedicated hardware device emulating an engine, a transmission mechanism, or the like. The output from the ECU is input to a monitoring computer, and further displayed on a display to allow a person in charge of the test to check if there is any abnormal action while viewing the display.
However, in HILS, the dedicated hardware device is used, and the device and the real ECU have to be physically wired. Thus, HILS involves a lot of preparation. Further, when a test is conducted by replacing the ECU with another, the device and the ECU have to be physically reconnected, requiring even more work. Further, since the test uses the real ECU, it takes an actual time to conduct the test. Therefore, it takes an immense amount of time to test many scenarios. In addition, the hardware device for emulation of HILS is generally very expensive.
Therefore, there has recently been a technique using software without using such an expensive emulation hardware device. This technique is called SILS (Software In the Loop Simulation), in which components to be mounted in the ECU, such as a microcomputer and an I/O circuit, a control scenario, and all plants such as an engine and a transmission, are configured by using a software simulator. This enables the test to be conducted without the hardware of the ECU.
As a system for supporting such a configuration of SILS, for example, there is a simulation modeling system, MATLAB®/Simulink® available from Mathworks Inc. In the case of using MATLAB®/Simulink®, functional blocks indicated by rectangles are arranged on a screen through a graphical interface as shown in
Thus, when the block diagram of the functional blocks or the like is created on MATLAB®/Simulink®, it can be converted to C source code of an equivalent function using the function of Real-Time Workshop®. This C source code is so compiled that simulation can be performed as SILS on another computer system.
Therefore, as shown in
In the meantime, techniques for allocating multiple tasks or processes to respective processors to parallelize the processes in a multiprocessor system are described in the following documents.
Japanese Patent Application Publication No. 9-97243 is to shorten the turnaround time of a program composed of parallel tasks in a multiprocessor system. In a system disclosed, a source program of a program composed of parallel tasks is complied by a compiler to generate a target program. The compiler generates an inter-task communication amount table holding the amount of data of inter-task communication performed between tasks of the parallel tasks. From the inter-task communication amount table and a processor communication cost table defining data communication time per unit data in a set of all processors in the multiprocessor system, a task scheduler decides and registers, in a processor control table, that a processor whose time of inter-task communication becomes the shortest is allocated to each task of the parallel tasks.
Japanese Patent Application Publication No. 9-167144 discloses a program creation method for altering a parallel program in which plural kinds of operation procedures and plural kinds of communication procedures corresponding to communication processing among processors are described to perform parallel processing. When the communication amount of communication processing performed according to a currently used communication procedure is assumed to be increased, if the time from the start of the parallel processing until the end of thereof is shortened, the communication procedures in the parallel program are rearranged to change the description content to merge two or more communication procedures.
Japanese Patent Application Publication No. 2007-048052 is related to a compiler for optimizing parallel processing. The compiler records the number of execution cores as the number of processor cores for executing a target program. First, the compiler detects dominant paths as candidates for execution paths to be continuously executed by a single processor core in the target program. Next, the compiler selects a number of dominant paths equal to or smaller than the number of execution cores to generate a cluster of tasks to be executed in parallel or continuously by a multi-core processor. Next, the compiler calculates an execution time when a number of processor cores, equal to one or more natural numbers, execute generated clusters on a cluster basis for each of the one or more natural numbers equal to or smaller than the number of execution cores. Then, based on the calculated execution time, the compiler selects the number of processor cores to be allocated to execute each cluster.
However, these disclosed techniques cannot always achieve efficient parallelization when directed graph processing as shown in
On the other hand, a technique adapted to the parallelization of clusters shown in
It is an object of this invention to provide a parallelization technique capable of taking advantage of parallelism in strongly-connected components and enabling a high-speed operation in such a simulation model that tends to increase the size of the strongly-connected components.
As a precondition of carrying out this invention, it is assumed that the system is in a multi-core or multiprocessor environment. In such a system, a program for parallelization is created by, but should not be limited to, a simulation modeling tool such as MATLAB®/Simulink®. In other words, the program is described with control blocks connected by directed edges indicating a flow of processes.
The first step according to the present invention is to select highly predictable edges from the edges.
In the next step, a processing program according to the present invention finds strongly-connected clusters. After that, strongly-connected clusters each including only one block and adjacent to each other are merged in a manner not to impede parallelization and the merged cluster is set as a non-strongly connected cluster.
In the next step, the processing program according to the present invention creates a parallelization table for each of the formed strongly-connected clusters and non-strongly connected clusters.
In the next step, the processing program according to the present invention converts, into a series-parallel graph, a graph having strongly-connected clusters and non-strongly connected clusters as nodes.
In the next step, the processing program according to the present invention merges parallelization tables based on the hierarchy of the series-parallel graph.
In the next step, the processing program according to the present invention selects the best configuration from the parallelization tables obtained, and based on this configuration, clusters are actually allocated to cores or processors, individually.
According to this invention, a parallelization technique is used, which takes advantage of parallelism of strongly-connected components in such a simulation model that tends to increase the size of the strongly-connected components, thereby increasing the operation speed.
A configuration and processing of one preferred embodiment of the present invention will now be described with reference to the accompanying drawings. In the following description, the same components are denoted by the same reference numerals throughout the drawings unless otherwise noted. Although the configuration and processing are described here as one preferred embodiment, it should be understood that the technical scope of present invention is not intended to be limited to this embodiment.
First, the hardware of a computer used to carry out the present invention will be described with reference to
A keyboard 410, a mouse 412, a display 414 and a hard disk drive 416 are connected to an I/O bus 408. The I/O bus 408 is connected to the host bus 402 through an I/O bridge 418. The keyboard 410 and the mouse 412 are used by an operator to perform operations, such as to enter a command and click on a menu. The display 414 is used to display a menu on a GUI to operate, as required, a program according to the present invention to be described later.
IBM® System X can be used as the hardware of a computer system suitable for this purpose. In this case, for example, Intel® Xeon® may be used for CPU1404a, CPU2404b, CPU3404c, . . . CPUn 404n, and the operating system may be Windows® Server 2003. The operating system is stored in the hard disk drive 416, and read from the hard disk drive 416 into the main memory 406 upon startup of the computer system.
Use of a multiprocessor system is required to carry out the present invention. Here, the multiprocessor system generally means a system intended to use one or more processors having multiple cores of processor functions capable of performing arithmetic processing independently. It should be appreciated that the multiprocessor system may be either of a multi-core single-processor system, a single-core multiprocessor system and a multi-core multiprocessor system.
Note that the hardware of the computer system usable for carrying out the present invention is not limited to IBM® System X and any other computer system can be used as long as it can run a simulation program of the present invention. The operating system is also not limited to Windows®, and any other operating system such as Linux® or Mac OS® can be used. Further, a POWER™ 6-based computer system such as IBM® System P with operating system AIX™ may also be used to run the simulation program at high speed.
Also stored in the hard disk drive 416 are MATLAB®/Simulink®, a C compiler or C++ compiler, modules for analysis, flattening, clustering and unrolling according to the present invention to be described later, a code generation module for generating codes to be allocated to the CPUs, a module for measuring an expected execution time of a processing block, etc., and they are loaded to the main memory 406 and executed in response to a keyboard or mouse operation by the operator.
Note that a usable simulation modeling tool is not limited to MATLAB®/Simulink®, and any other simulation modeling tool such as open-source Scilab/Scicos can be employed.
Otherwise, in some cases, the source code of the simulation system can also be written directly in C or C++ without using the simulation modeling tool. In this case, the present invention is applicable as long as all the functions can be described as individual functional blocks dependent on each other.
In
The simulation modeling tool can also be installed on another personal computer so that source code generated there can be downloaded to the hard disk drive 416 via a network or the like.
The source code 504 thus output is stored in the hard disk drive 416.
An analysis module 506 receives the input of the source code 504, parses the source code 504 and converts the connections among the blocks into a graph representation 508. It is preferred to store data of the graph representation 508 in the hard disk drive 416.
A clustering module 510 reads the graph representation 508 to perform clustering by finding strongly-connected components (SCC). The term “strongly-connected” means that there is a directed path between any two points in a directed graph. The term “strongly-connected component” means a subgraph of a given graph. The subgraph itself is strongly-connected so that if any vertex is added, the subgraph will be no longer strongly-connected.
A parallelization table processing module 514 has the function of creating a parallelization table 516 by processing to be described later based on the clusters obtained by the clustering module 510 performing clustering.
It is preferred that the created parallelization table 516 be placed in the main memory 406, but it may be placed in the hard disk drive 416.
A code generation module 518 refers to the graph representation 508 and the parallelization table 516 to generate source code to be compiled by a compiler 520. As the programming language assumed by the compiler 520, any programming language programmable in conformity to a multi-core or multiprocessor system, such as C, C++, C#, or Java™, can be used, and the code generation module 518 generates source code for each cluster according to the programming language.
An executable binary code (not shown) generated by the compiler 520 for each cluster is allocated to a different core or processor based on the content described in the parallelization table 516 or the like, and executed in an execution environment 522 by means of the operating system.
Processing of the present invention will be described in detail below according to a series of flowcharts, but before that, the definition of terms and notation will be given.
Set
X represents a complementary set of the set X.
X−Y=X∩Y
X[i] is the i-th element of set X.
MAX(X) is the largest value recorded in the set X.
FIRST(X) is the first element of the set X.
SECOND(X) is the second element of the set X.
Graph
Graph G is represented by <V, E>.
V is a set of nodes in the graph G.
E is a set of edges connecting vertices (nodes) in the graph G.
PARENT(v) is a set of parent nodes of nodes v (εV) in the graph G.
CHILD(v) is a set of child nodes of nodes v (εV) in the graph G.
SIBLING(v) is defined by {c:c!=v, cεCHILD(p), pεPARENT(v)}.
With respect to edge e=(u, v), (uεV, vεV),
SRC(e):=u
DEST(e):=v
Cluster
Cluster means a set of blocks. SCC is also a set of blocks, which is of a kind of cluster.
WORKLOAD(C) is the workload of cluster C. The workload of the cluster C is calculated by summing the workloads of all the blocks in the cluster C.
START(C) represents the starting time of the cluster C when static scheduling is performed on a set of clusters including the cluster C.
END(C) represents the ending time of the cluster C when static scheduling is performed on the set of clusters including the cluster C.
Parallelization Table T
T is a set of entries I as shown below.
I:=<number of processors, length of schedule (also referred to as cost and/or workload), set of clusters>
ENTRY(T, i) is an entry in which the first element is i in the parallelization table T.
LENGTH(T, i) is the second element of the entry in which the first element is i in the parallelization table T. If such an entry does not exist, return ∞.
CLUSTERS(T, i) is a set of clusters recorded in the entry in which the field of the processor is i in the parallelization table T.
Series-Parallel Graph
series-parallel nested tree Gsp-tree is a binary tree represented by <Vsp-tree, Esp-tree>.
Vsp-tree represents a set of nodes of Gsp-tree, in which each node consists of a set (f, s) of edges and symbols. Here, fεEpt-sp (where Ept-sp is a set in which edges in a graph are elements) is sε{“L”, “S”, “P”}.
“L” is a symbol representing the type of leaf, “S” is of series and “P” is of parallel.
Esp-tree is a set of edges (u, v) of the tree Gsp-tree.
EDGE (n) (nεVsp-tree) is the first element of n.
SIGN (n) (nεVsp-tree) is the second element of n.
LEFT (n) (nεVsp-tree) is a left child node of node n in the tree Gsp-tree .
RIGHT (n) (nεVsp-tree) is a right child node of node n in the tree Gsp-tree .
Referring to
First, this graph is represented by G:=<V, E>, where V is a set of blocks and E is a set of edges.
Returning to
The graph representation after the predictable edges are thus removed is represented as Gpred:=<Vpred, Epred>. In this case, Vpred=V and Epred=E−Set of predictable edges.
The predictable edge is to select a signal (an edge on the block diagram) generally indicative of the speed of an object or the like, which is continuous and shows no acute change in a short time. Typically, it is possible to have a model creator write annotation on the model so that the compiler can know which edge is predictable.
In step 604, the clustering module 510 detects strongly-connected components (SCCs). In
Using the SCCs thus detected, the graph of SCCs are represented as
GSCC:=<VSCC, ESCC>.
Here, VSCC is a set of SCCs created by this algorithm, and
ESCC is a set of edges connecting SCCs in VSCC.
Here, Vloop as a set of SCCs, where nodes form a loop (i.e., SCCs each including two or more blocks), is also created.
In step 606, adjacent SCCs each including only one block are merged by the clustering module 510 to form a non-SCC cluster so as not to impede subsequent parallelization. This situation is shown in
The graph thus merged is represented as Garea:=<Varea, Earea>.
Here, Varea is a set of non-SCC clusters newly formed as a result of merging by this algorithm and SCC clusters without any change in this algorithm, and
Earea is a set of edges connecting between elements of the Varea .
Here, Vnon-loop as a newly created set of non-SCC clusters is also created.
In step 608, the parallelization table processing module 514 calculates a parallelization table for each cluster in Vloop. Thus, a set Vpt-loop of parallelization tables can be obtained.
In step 610, the parallelization table processing module 514 calculates a parallelization table for each cluster in Vnon-loop. Thus, a set Vpt-non-loop of parallelization tables can be obtained.
The parallelization tables thus obtained are shown in
In step 612, the parallelization table processing module 514 constructs a graph in which each parallelization table is taken as a node.
The graph thus constructed is represented as Gpt:=<Vpt, Ept>.
Here, Vpt is a set of parallelization tables created by this algorithm, and
Ept is a set of edges connecting between elements of the Vpt.
In step 614, the parallelization table processing module 514 unifies the parallelization tables in the Vpt. In this unification processing, the Gpt is first converted into a series-parallel graph and a series-parallel nested tree is generated therefrom. An example of the series-parallel nested tree generated here is shown at 1202 in
An example of the unified parallelization table Tunified is shown in
The parallelization table processing module 514 selects the best configuration from the unified parallelization table Tunified. As a result, a resulting set of clusters Rfinal can be obtained. In the example of
The following describes each step of the general flowchart in
As shown, in step 1502, the following processing is performed:
An SCC algorithm is applied to the Gpred. For example, this SCC algorithm is described in “Depth-first search and linear graph algorithms,” R. Tarjan, SIAM Journal on Computing, pp. 146-160, 1972.
VSCC=Set of SCCs obtained by the algorithm
ESCC={(C, C′):CεVSCC, C′εVSCC, C!=C′, ∃(u, v)εEpred, uεC, vεC′}
GSCC=<VSCC, ESCC>
Vloop={C:CεVSCC, |C|>1}
In step 1602, variables are set as follows:
S=stack, T=Empty map between SCC and new cluster
Varea=Empty set of new clusters.
In step 1604, it is determined whether all elements of H have been processed, and if not, the procedure proceeds to step 1606 in which one of unprocessed SCCs in H is extracted and set as C.
In step 1608, it is determined whether cεVloop, and if so, the procedure proceeds to step 1610 in which processing for putting all elements in {C′:C′ε{CHILD(C)∩Vloop}} into S is performed.
Here, Vloop is a complementary set of the Vloop when the VSCC is set as the whole set.
Next, the procedure proceeds to step 1612 in which a new empty cluster Cnew is created and the Cnew is added to Varea.
Returning to step 1608, if not CεVloop, C is put into S in step 1614, and the procedure proceeds to step 1612.
In step 1616, it is determined whether |S|=0, and if so, the procedure returns to step 1604.
If it is determined in step 1616 that it is not |S|=0, the procedure proceeds to step 1618 in which the following processing is performed:
Extract C from S
Put (C, Cnew) into T
F=CHILD(C)
Next, the procedure proceeds to step 1620 in which it is determined whether |F|=0, and if so, the procedure returns to step 1620.
If it is determined in step 1620 that it is not |F|=0, the procedure proceeds to step 1622 in which processing for acquiring one element Cchild from F is performed.
Next, in step 1624, it is determined whether CchildεH, and if so, the procedure returns to step 1620.
If it is determined in step 1624 that it is not CchildεH, it is determined in step 1626 whether |{(Cchild, C′)εT: C′εVarea}|=0, and if so, Cchild is put into S in step 1628, and after that, the procedure returns to step 1620.
If it is determined in step 1626 that it is not |{Cchild, C′)εT:C′εVarea}|=0, it is determined in step 1630 whether C′==Cnew, and if so, the procedure returns to step 1620.
If it is determined in step 1630 that it is not C′==Cnew, a function as Clear_path_and_assign (Cchild, T) is called in step 1632, and the procedure returns to step 1620.
The details of Clear_path_and_assign (Cchild, T) will be described later.
Returning to step 1604, if it is determined that all elements C in H have been processed, the procedure proceeds to step 1634 to end the processing after performing the following:
Put all blocks in C into Cnew for all elements (C, Cnew) in T
Earea={(C, C′):CεVarea, C′εVarea, C!=C′, ∃(u, v)εEpred, uεC, vεC′}
In step 1702, the following is set up:
Put Cchild into S1.
Find, from T, an element (Cchild, Cprev
Create a new empty cluster Cnew.
Put Varea into Cnew.
In step 1704, it is determined whether |S1|=0, and if so, the processing is ended.
If it is determined in step 1704 that it is not |S1|=0, the following processing is performed in step 1706:
Extract C from S1.
Remove, from T, an element (C, X) whose first element is C, where XεVarea.
In step 1708, it is determined whether |F1|=0, and if so, the procedure returns to step 1704, while if not, the procedure proceeds to step 1710 in which processing for acquiring Cgc from F1 is performed.
Next, the procedure proceeds to step 1712 in which it is determined whether CgcεH, and if so, the procedure returns to step 1708.
If it is determined in step 1712 that it is not CgcεH, an element (Cgc, Cgca) whose first element is Cgc is found from T in step 1716, and in the next step 1718, it is determined whether Cprev
Referring next to a flowchart of
In
In step 1804, it is determined whether |Vloop|=0, and if so, this processing is ended.
In the next step 1806, the following processing is performed:
i=1
Obtain cluster C from Vloop.
L={(u, v):uεC, vεC, (u, v)εEpred}
Tc=New parallelization table for 0 entry
Here, Gtmp=<C, L> means that a graph in which blocks included in C are chosen as nodes and edges included in L are chosen as edges is represented as Gtmp.
In step 1808, it is determined whether i<=m, and if not, Tc is put into the Vpt-loop in step 1810 and the procedure returns to step 1804.
If it is determined in step 1808 that i<=m, the procedure proceeds to step 1812 in which S={s:sεC, |PARENT(s)∩C|>0} is set.
In the next step 1814, it is determined whether |S|=0, and if so, i is incremented by one and the procedure returns to step 1808.
If it is determined in step 1814 that it is not |S|=0, is obtained from S in step 1818, and in step 1820, processing for detecting a set of back edges from the Gtmp is performed. This is done, on condition that entry nodes in the Gtmp are s, by a method, for example, as described in the following document: Alfred V. Aho, Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman, “Compilers: Principles, Techniques, and Tools (2nd Edition)”, Addison Wesley.
Here, the detected set of back edges is put as B.
Then, G,=<C, L-B>.
In step 1822, processing for clustering blocks in C into i clusters is performed. This is done, on condition that the number of available processors is i, by applying, to Gc, a multiprocessor scheduling method, for example, as described in the following document: Sih G. C., and Lee E. A. “A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures,” IEEE Trans. Parallel Distrib. Syst. 4, 2 (Feb. (1993)), 75-87. As a result of such scheduling, each block is executed by any processor, and a set of blocks to be executed by one processor is set as one cluster.
Then, the resulting set of clusters (i clusters) is put as R and the schedule length resulting from G, is t.
Here, the schedule length means time required from the start of the processing until the completion thereof as a result of the above scheduling.
At this time, the starting time of processing for a block to be first executed as a result of the above scheduling is set to 0, and the starting time and ending time of each cluster are recorded as the time at which processing for the first block is performed on a processor corresponding to the cluster and the time at which processing for the last block is ended, respectively, keeping them referable.
In step 1824, it is set as t′=LENGTH(Tc, i), and the procedure proceeds to step 1826 in which it is determined whether t<t′. If so, the entry (i, t, R) is put into Tc in step 1828 and the procedure returns to step 1814. If not, the procedure returns directly to step 1814.
Referring next to a flowchart of
In
In step 1904, it is determined whether |Vnon-loop|=0, and if so, this processing is ended.
If it is determined in step 1906 that it is not |Vnon-loop|=0, i is set to 1 in step 1906, cluster C is acquired from the Vnon-loop, and processing for setting, to Tc, a new parallelization table for 0 entry is performed.
In step 1908, it is determined whether i<=m, and if not, the procedure proceeds to step 1910 in which T, is put into Vpt-non-loop and the procedure returns to step 1904.
If it is determined in step 1908 that i<=m, processing for clustering nodes in C into i clusters is performed in step 1912. This is done, on condition that the number of available processors is i, by applying, to Gc, a multiprocessor scheduling method, for example, as described in the following document: G. Ottoni, R. Rangan, A. Stoler, and D. I. August, “Automatic Thread Extraction with Decoupled Software Pipelining,” In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, November 2005.
Then, the resulting set consisting of i clusters is set to R, MAX_WORKLOAD(R) is set to t, (i, t, R) is put into Tc, i is incremented by one, and the procedure returns to step 1908. At this time, the starting time of processing for a block to be first executed as a result of the above scheduling is set to 0, and the starting time and ending time of each cluster are recorded as the time at which processing for the first block is performed on a processor corresponding to the cluster and the time at which processing for the last block is ended, respectively, keeping them referable.
Next, a set of edges of the graph consisting of the parallelization tables is given by the following equation:
E
pt:={(T,T′):TεVpt,T′εVpt,T!=T′,∃(u,v)εEpred,uεFIRST(CLUSTERS(T, 1)),vεFIRST(CLUSTERS(T′, 1))}
As mentioned above, the graph consisting of the parallelization tables is constructed by Gpt:=<Vpt, Ept>. Note that CLUSTERS (T, 1) always returns one cluster. This is because the number of available processors is one as shown in the second argument.
In addition, edges having the same pair of end points are merged.
Referring next to a flowchart of
First, in step 2102, processing for converting Gpt into a series-parallel graph Gpt-sp=<Vpt-sp, Ept-sp> is performed. This is done by a method, for example, as described in the following document: Arturo Gonzalez Escribano, Valentin Cardenoso Payo, and Arjan J. C. van Gemund, “Conversion from NSP to SP graphs,” Tech. Rep. TRDINFO-01-97, Universidad de Valladolid, Valladolid (Spain), 1997.
Next, Vpt-sp is obtained as follows:
Here, Vdummy is a set of dummy nodes added by this algorithm. Each dummy node is a parallelization table {(i, 0, φ):i=1, . . . , m} where m is the number of processors available in the target system.
Further, Ept-sp is obtained as follows:
Here, Edummy is a set of dummy edges added by this algorithm to connect elements of the Vpt-sp.
In step 2104, Gsp-tree is obtained by the following equation:
G
sp-tree:=get_series_parallel_nested_tree(Gpt-sp)
Note that the function called get_series_parallel_nested_tree ( ) will be described in detail later.
In step 2106, nroot:=Root node of Gsp-tree is set. This root node is a node having no parent node, and such a node exists only once in the Gsp-tree .
Next, Tunified is obtained by the following equation:
T
unified:=get_table(nroot)
Note that the function called get_table ( ) will be described in detail later.
Referring next to a flowchart of
First, in step 2202, copies are once made as Vcpy=Vpt-sp, Ecpy=Ept-sp.
In step 2204, the set is updated by Scand={T:TεVcpy, |{e=(T′, T):eεEcpy}|=1|{e=(T, T″): eεEcpy}|=1}.
In step 2206, it is determined whether |Scand|=0, and if so, Gsp-tree:=<Vsp-tree, Esp-tree> is set and processing is ended.
If it is determined in step 2206 that it is not |Scand|=0, the procedure proceeds to step 2210 to perform the following processing:
First, acquire I from Scand
f:=(T′, T), f′:=(T, T″)
Create new edge f″=(T′, T″).
nsnew=(f″, “S”)
Put nsnew into Vsp-tree.
Next, the procedure proceeds to step 2212 in which it is determined whether f is a newly created edge. If so, the procedure proceeds to step 2214 to perform processing for finding, from the Vsp-tree, node n as FIRST(n)=f is performed.
On the other hand, if it is determined in step 2212 that f is not a newly created edge, the procedure proceeds to step 2216 to create new tree node n=(f, “L”) and put n into the Vsp-tree.
From step 2214 or 2216, the procedure proceeds to step 2218 in which processing for putting (nsnew, n) into the Esp-tree is performed.
Next, the procedure proceeds to step 2220 in which it is determined whether f′ is a newly created edge. If so, the procedure proceeds to step 2222 in which processing for finding, from the Vsp-tree, node n′ as FIRST(n′)=f′ is performed.
On the other hand, if it is determined in step 2220 that f′ is not a newly created edge, the procedure proceeds to step 2224 to create new tree node n′=(f′, “L”) and put n′ into the Vsp-tree.
From step 2222 or 2224, the procedure proceeds to step 2226 in which processing for putting (nsnew, n′) into the Esp-tree is performed. Further, P={p=(T′, T″):pεEcpy} is set.
Next, in step 2228, it is determined whether |P|=0, and if so, the procedure proceeds to step 2230 in which f″ is put into the Vcpy. Then, in the next step 2232, T is removed from the Vcpy, f′ and f″ are removed from the Ecpy, and the procedure returns to step 2204.
Returning to step 2228, it is determined that it is not |P|=0, the procedure proceeds to step 2234 in which one element p is acquired from P.
Next, in step 2236, it is determined whether p is a newly created edge, and if so, processing for finding node r as FIRST(r)=p from the Vsp-tree is performed in step 2238.
In step 2236, if it is determined that p is not a newly created edge, the procedure proceeds to step 2240 in which processing for creating new tree node r=(p, “L”) and putting r into the Vsp-tree is performed.
From step 2238 or step 2240, the procedure proceeds to step 2242 in which processing for creating new edge f″′=(T′, T″), setting npnew=(f′″, “P”), putting (npnew, nsnew) into ET, putting (npnew, r) into ET, removing p from Ecpy and putting f′″ into Ecpy is performed.
From step 2242, the procedure returns to step 2204 via step 2232 already described above.
In
If it is determined in step 2302 that SIGN(l)=“L,” the procedure proceeds to step 2304 in which Tc=NULL is set. Then, in step 2306, Tc is returned, and the processing is ended.
If it is determined in step 2302 that it is not SIGN(l)=“L,” the procedure proceeds to step 2308 in which l=LEFT (n), r=RIGHT (n), Tl=get_table (l) and Tr=get_table(r) are calculated. Since this flowchart is to perform processing on get_table ( ), get_table (l) and get_table(r) are recursive calls.
Next, the procedure proceeds to step 2310 in which it is determined whether SIGN(l)=“S.” If not, Tc=parallel_merge (Tl, Tr) is set in step 2312, Tc is returned in step 2306, and the processing is ended. The details of parallel_merge ( ) will be described later.
If it is determined in step 2310 that SIGN (n)=“S,” el=EDGE (l) and Tc=DEST (el) are set in step 2314, and it is determined in step 2316 whether Tl=NULL. If not, Tc=series_merge (Tl, TC) is set in step 2318, and the procedure proceeds to step 2320. If so, the procedure proceeds directly to step 2320. The details of series_merge ( ) will be described later.
Next, it is determined in step 2320 whether Tr=NULL, and if not, Tc=series_merge (Tc, Tr) is set in step 2322, and the procedure proceeds to step 2306. If so, the procedure proceeds directly to step 2306. Thus, Tc is returned and the processing is ended.
Referring next to a flowchart of
If Tl==NULL, the procedure proceeds to step 2410 in which it is determined whether Tr==NULL. If not, Tnew=Tr is set in step 2412, Tnew is returned in step 2408, and the processing is ended.
If Tr==NULL, the procedure proceeds to step 2414 in which Tnew=NULL is set, Tnew is returned in step 2408, and the processing is ended.
If it is determined in step 2402 to be neither Tl==NULL nor Tr==NULL, the procedure proceeds to step 2416 in which the number of available processors is set to m, and a new empty parallelization table is set to Tnew.
Then, in step 2417, 1 is set to i, and it is determined in step 2418 whether i<=m. If it is not i<=m, the procedure proceeds to step 2408 to return Tnew and end the processing.
If i<=m, j=1 is set in step 2420. Then, in step 2422, it is determined whether j<=m, and if not, i is incremented by one in step 2424 and the procedure returns to step 2418.
If it is determined in step 2422 that j<=m, the procedure proceeds to step 2426 in which it is determined whether i+j<=m. If so, the procedure proceeds to step 2428 in which the following processing is performed:
lsl=LENGTH (Tl, i)
lsr=LENGTH (Tr, j)
ls=MAX (lsl, lsr)
Following step 2428, it is determined in step 2430 whether ls<LENGTH (Tnew, i+j), and if so, (i+j, ls, Rnew) is recorded in Tnew in step 2432. Then, the procedure proceeds to step 2434. If it is determined in step 2430 that it is not ls<LENGTH (Tnew, i+j), the procedure proceeds directly to step 2434.
In step 2434, it is determined whether i=j, and if so, the following processing is performed in step 2436:
(Rnew, ls)=merge_clusters_in_shared (Rl, Rr, i)
Note that processing for merge_clusters_in_shared ( ) will be described in detail later.
Following step 2436, it is determined in step 2438 whether ls<LENGTH (Tnew, i), and if so, (i, ls, Rnew) is recorded in Tnew in step 2440. Then, the procedure proceeds to step 2442. If it is determined in step 2430 that it is not ls<LENGTH (Tnew, i), the procedure proceeds directly to step 2442.
If it is determined in step 2434 that it is not i=j, the procedure proceeds directly from step 2434 to step 2442 as well. In step 2442, j is incremented by one and the procedure returns to step 2422.
Referring next to a flowchart of
If T1==NULL, the procedure proceeds to step 2510 in which it is determined whether Tr==NULL. If not, Tnew=Tr is set in step 2512, Tnew is returned in step 2508, and processing is ended.
If Tr==NULL, the procedure proceeds to step 2514 in which Tnew=NULL is set. Then, Tnew is returned in step 2508, and the processing is ended.
If it is determined in step 2502 to be neither Tl==NULL nor Tr==NULL, the procedure proceeds to step 2516 in which the number of available processors is set to m, and a new empty parallelization table is set to Tnew.
Further, the following is set:
T1=series_merge (T1, Tr)
T2=series_merge (Tr, T1)
The description of series_merge is already made with reference to
In step 2518, 1 is set to i, and in step 2520, it is determined whether i<=m. If it is not i<=m, the procedure goes to step 2508 to return Tnew and end the processing.
If i<=m, the procedure proceeds to step 2522 in which l1 and l2 are set by the following equation:
l1=LENGTH(T1, i)
l2=LENGTH(T2, i)
In step 2524, it is determined whether l1<l2, and if so, R=CLUSTERS(T1, i) is considered and (i, l1, R) is recorded in Tnew in step 2526.
If it is not l1<l2, R=CLUSTERS(T2, i) is considered and (i, l2, R) is recorded in Tnew in step 2528.
Next, i is incremented by one in step 2530 and the procedure returns to step 2520.
Referring next to a flowchart of
First, in step 2602, clusters in Rl are sorted by ending time in ascending order.
Clusters in Rr are also sorted by ending time in ascending order.
Next, index x is selected from 1 to i to make END(R1[x])−START(R2[x]) maximum.
Further, the following is calculated:
w=MAX({v=END(R1[u])+gap[u]+WORKLOAD(R2[u]):
gap[u]=END (R1[x])−START(R2[x])+START(R2[u])−END(R1[u]), u=1, . . . , i})
R:={Ru:Ru:=Rl[u]∪R2[u], u=1, . . . , i}
In step 2604, (R, w) is returned, and the processing is ended.
Referring next to a flowchart of
In step 2702, the number of available processors is set to m. It is also set i=1 and min=∞. Here, ∞ takes a considerably high number in actuality.
In step 2704, it is determined whether i<=m, and if so, w=LENGTH(Tunified, i) is calculated in step 2706, and it is determined in step 2708 whether w<min.
If it is not w<min, the procedure returns to step 2704. If w<min, min=w is set in step 2170, Rfinal=CLUSTERS(Tunified, i) is calculated in step 2712, and the procedure returns to step 2704.
If it is determined in step 2704 that it is not i<=m, the processing is ended. Rfinal as of then becomes the result to be obtained.
Returning to
The methodologies of embodiments of the invention may be particularly well-suited for use in an electronic device or alternative system. Accordingly, the present invention may take the form of an entirely hardware embodiment or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “processor”, “circuit,” “module” or “system.” Furthermore, the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code stored thereon.
Any combination of one or more computer usable or computer readable medium(s) may be utilized. The computer-usable or computer-readable medium may be a computer readable storage medium. A computer readable storage medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus or device.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
The present invention is described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
It is to be appreciated that the term “processor” as used herein is intended to include any processing device, such as, for example, one that includes a central processing unit (CPU) and/or other processing circuitry (e.g., digital signal processor (DSP), microprocessor, etc.). Additionally, it is to be understood that the term “processor” may refer to more than one processing device, and that various elements associated with a processing device may be shared by other processing devices. The term “memory” as used herein is intended to include memory and other computer-readable media associated with a processor or CPU, such as, for example, random access memory (RAM), read only memory (ROM), fixed storage media (e.g., a hard drive), removable storage media (e.g., a diskette), flash memory, etc. Furthermore, the term “I/O circuitry” as used herein is intended to include, for example, one or more input devices (e.g., keyboard, mouse, etc.) for entering data to the processor, and/or one or more output devices (e.g., printer, monitor, etc.) for presenting the results associated with the processor.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While this invention has been described based on the specific embodiment, this invention is not limited to this specific embodiment. It should be understood that various configurations and techniques such as modifications and replacements, which would be readily apparent to those skilled in the art, are also applicable. For example, this invention is not limited to the architecture of a specific processor, the operating system and the like.
Further, the aforementioned embodiment is related primarily to parallelization in a simulation system for vehicle SILS, but this invention is not limited to this example. It should be understood that the invention is applicable to a wide variety of simulation systems for other physical systems such as airplanes and robots.
Number | Date | Country | Kind |
---|---|---|---|
2009-232369 | Oct 2009 | JP | national |