This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-198907, filed on Dec. 7, 2021, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a non-transitory computer-readable recording medium storing a conversion program and a conversion method.
In the field of high performance computing (HPC), parallel programming for shared-memory type processors is a mainly data parallel description by open multi-processing (OpenMP). In the data parallel, a parallelizable loop is divided and allocated to each thread to be executed in parallel. In order to ensure computation completion after the loop is executed, overall synchronization is performed between the threads used for parallel execution.
International Publication Pamphlet No. WO 2007/096935 and Japanese Laid-open Patent Publication No. 2009-104422 are disclosed as related art.
According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores a conversion program causing a computer to execute a process including: generating, based on a dependency relationship between statements in a program, a directed graph in which the statement in the program is a node and the dependency relationship is an edge; detecting, based on the dependency relationship represented by the edge in the generated directed graph, a node of which a part of a loop process has a dependency relationship with another preceding or following node, from the directed graph; updating the directed graph by dividing the detected node into a first node that has the part of the loop process and a second node that has a loop process other than the part of the loop process, fusing the divided first node and the another node, and assigning dependency information based on a data access pattern to a node after fusing; and converting the program, based on the directed graph after update.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
For example, there is a technique of obtaining a reversibly degenerate dependent element group by using program analysis information including a plurality of dependent elements representing a dependency relationship between a statement and control of a program, and generating a program dependency graph in which the dependent element is degenerated by degenerating the dependent element group. There is another technique in which, in response to a generation policy of a parallel code input by a user, a process of the code is divided, and a parallelization method is obtained while predicting an execution cycle from a computation amount, process contents, cache use of reused data, and a main memory access data amount.
Meanwhile, in the related art, program parallelization efficiency is decreased, in some cases. For example, when a cost of the overall synchronization is increased due to an increase in the number of cores of the shared-memory type processor or a variation in computation, the parallelization efficiency is decreased and program performance is decreased.
In one aspect, an object of the present disclosure is to improve parallelization efficiency of a program.
Hereinafter, embodiments of a conversion program and a conversion method according to the disclosure are described in detail with reference to the drawings.
The data parallel description is a description for performing a computation by data parallel. In the field of HPC, parallel programming for a shared-memory type processor often uses a data parallel description by OpenMP. The OpenMP is an application programming interface (API) that enables parallel programming in a shared-memory type machine.
In the OpenMP, a description is made by using an instruction statement to a compiler called a pragma directive (#pragma). For example, by designating the instruction statement for a parallelizable loop, the loop may be divided and allocated to each thread to be executed in parallel. In order to ensure computation completion after the loop is executed, overall synchronization is performed between the threads used for parallel execution. Meanwhile, in a case where there is no dependency relationship between a plurality of loops, it is also possible that the threads are not synchronized with each other.
On the other hand, the number of cores of the shared-memory type processor is increasing year by year, and a cost of the overall synchronization tends to be increased. The overall synchronization between the threads will be described with reference to
In this case, overall synchronization is performed between the threads to ensure computation completion after the execution of the loop. In the example in
Therefore, in order to increase a speed of a program, for example, it is desirable to reduce the overall synchronization as much as possible, and start the computations one after another by the empty thread (core) with more fine-grained synchronization. Meanwhile, since a user is requested to determine whether or not there is a dependency relationship between the loops and to perform programming that causes the dependency relationship to disappear, there is a problem that an implementation cost is increased.
The dependent task parallel description is a description for speeding up a program from overall synchronization to inter-task synchronization, by making a computation a task and explicitly describing read/write of data to be used in the task. The tasks are executed in parallel based on data-dependent descriptions (in, out, and inout) between the tasks in dependent task parallel by OpenMP.
In data parallel, the data is divided and mapped to the threads. By contrast, in task parallel, a task is generated, and it is determined by a runtime of a compiler whether a dependency is released from the task that is completely executed, and the task is executed, so that the procedure is complicated and many. Therefore, an overhead of the task parallel is larger as compared with an overhead of the data parallel.
As described above, in the data parallel description, a cost of the overall synchronization is high. It is difficult for the user to grasp the dependency relationship of the entire program and perform programming to reduce the overall synchronization. The task parallel has the larger overhead as compared with the data parallel.
Accordingly, in Embodiment 1, a conversion method will be described in which the program implemented by the data parallel description is automatically converted to the dependent task parallel description so as to reduce the number of task generations and increase parallelization efficiency while setting the tasks with an appropriate granularity and obtaining parallelism. Hereinafter, process examples ((1) to (4) below) of the conversion apparatus 101 will be described.
(1) Based on a dependency relationship between statements in a program, the conversion apparatus 101 generates a directed graph in which the statement in the program serves as a node and the dependency relationship between the statements serves as an edge. The program is a program to be converted, for example, a program of a data parallel description.
The statement is each statement such as a procedure, a command, or a declaration, which is a configuration unit of the program, and includes, for example, an equation, a function call, and the like. For example, the equation is a combination of a value, a variable, an operator, a function, and the like. The dependency relationship between the statements is, for example, a relationship based on a data dependency such as a flow dependency, an inverse flow dependency, and an output dependency.
The flow dependency is that written data is read out after the writing (Read After Write). The inverse flow dependency is opposite to the flow dependency, and writing is performed after reading (Write After Read). The output dependency is a dependency in which a separate value is written after writing (Write After Write). Even when there is a dependency relationship based on any data dependency of the flow dependency, the inverse flow dependency, and the output dependency between the statements, the statements may not be executed in parallel.
The directed graph is a graph including nodes and edges coupling the nodes, and each edge has a direction. A node that is not coupled to a separate node by the edge may be included in the directed graph. The node has, for example, data access information of the statement. For example, the data access information indicates an access range or an access pattern of the loop process. For example, the access pattern is represented by a variable or the like of an access (read/write) destination.
For example, the conversion apparatus 101 analyzes a dependency relationship between the statements in a program 110 by dependency analysis of the program 110 with a compiler. The program 110 is a program of a data parallel description. Based on a result of the dependency analysis of the program 110, the conversion apparatus 101 generates a directed graph 120.
The directed graph 120 includes nodes (for example, nodes 120-1 to 120-4) representing statements in the program 110 and edges (for example, edges 120-11 to 120-13) representing a dependency relationship between the statements. The dependency relationship is a relationship based on data dependency (flow dependency, inverse flow dependency, and output dependency).
(2) Based on the dependency relationship represented by the edge in the generated directed graph, the conversion apparatus 101 detects, from the directed graph, a node of which a part of a loop process has a dependency relationship with another preceding or following node. For example, it is assumed that a statement 1 represented by the node 120-1 has a loop process of reading and writing data from and to A[i] in a range from “i=0” to “i=N−1”.
It is assumed that a statement 2 represented by the node 120-2 has only a read for A[0]. In this case, the statements 1 and 2 depend only on A[0]. The statement 1 and statement 2 do not depend on each other in a range from “i=1” to “i=N−1”.
A case is assumed in which the node 120-1 is detected from the directed graph 120. The node 120-1 is a node of which a part of the loop process (i=0) has a dependency relationship with the other preceding node 120-2.
(3) The conversion apparatus 101 divides the detected node into a first node having a part of the loop process and a second node having the loop process other than the part of the loop process, and fuses the divided first node and the other node. The part of the loop process is a loop process having a dependency relationship with another preceding or following node, in the loop process of the detected node. The fusing of the nodes means that two nodes are collectively handled as one task.
By assigning dependency information based on data access pattern to the node after fusing, the conversion apparatus 101 updates the directed graph. The dependency information is information indicating what kind of access (read, write) is made to which data in a process (task) of each node. For example, the dependency information includes information such as “depend (out: A[0])” assigned after #pragma omp. With the dependency information, it is possible to determine what kind of dependency exists between the task and a separate task.
For example, the conversion apparatus 101 divides the node 120-1 into a first node 120-1a and a second node 120-1b. The first node 120-1a is a node having a part of the loop process having a dependency relationship with the other preceding node 120-2, in the loop process of the node 120-1. The second node 120-1b is a node having a loop process other than the part of the loop process having the dependency relationship with the other preceding node 120-2, in the loop process of the node 120-1.
After that, the conversion apparatus 101 fuses the divided first node 120-1a and the other node 120-2. A node 130 after fusing is obtained by fusing the first node 120-1a and the other node 120-2 as one task. The conversion apparatus 101 updates the directed graph 120 by assigning the dependency information based on the data access pattern to the node 130 after fusing.
In details, for example, the conversion apparatus 101 assigns dependency information 140 to the node 130 after fusing. The dependency information 140 indicates what kind of access (read, write) is made to which data when the node 130 after fusing is executed as one task.
(4) The conversion apparatus 101 converts the program based on the directed graph after update. For example, the conversion apparatus 101 converts the program 110 of the data parallel description into a program 150 of the dependent task parallel description, based on the directed graph 120 after update.
As an existing function of the compiler, there is a function of performing reversible conversion that restores an original program based on information obtained by creating a directed graph of the program. The conversion of the dependent task parallel description into the program 150 based on the directed graph 120 after update may be performed by using the existing function of such a compiler, for example.
As described above, with the conversion apparatus 101 according to Embodiment 1, in a case where only a part of the loop process of the node in the directed graph has a dependency relationship with the other preceding or following node, it is possible to divide only the part into a separate node and fuse the separate node and the other node. Therefore, in task parallelization, it is possible to reduce the number of generated tasks while acquiring parallelism, and to improve parallelization efficiency. For example, the conversion apparatus 101 may improve performance of the program by finding out parallelism, by performing division and fusion of the nodes based on a loop length or the data access pattern of the task target process.
Next, a conversion method according to Embodiment 2 will be described. A case where the conversion apparatus 101 illustrated in
First, an example of a hardware configuration of the information processing apparatus 200 according to Embodiment 2 is described with reference to
The CPU 201 controls an entirety of the information processing apparatus 200. The CPU 201 may include a plurality of cores. The memory 202 includes, for example, a read-only memory (ROM), a random-access memory (RAM), a flash ROM, and the like. For example, the flash ROM stores a program of an operating system (OS), the ROM stores an application program, and the RAM is used as a work area of the CPU 201. The programs stored in the memory 202 cause the CPU 201 to execute a coded process by being loaded into the CPU 201.
The disk drive 203 controls reading and writing of data from and to the disk 204 according to the control of the CPU 201. The disk 204 stores written data under the control of the disk drive 203. As the disk 204, for example, there are a magnetic disk, an optical disc, and the like.
The communication I/F 205 is coupled to a network 210 via a communication line and coupled to an external computer via the network 210. The communication I/F 205 functions as an interface between the network 210 and an inside of the apparatus and controls an input and an output of data from and to the external computer. For example, a modem, a LAN adapter, or the like may be adopted as the communication I/F 205.
The display 206 is a display device that displays data such as a cursor, icons, and a toolbox, and also displays documents, images, functional information, and the like. As the display 206, for example, a liquid crystal display, an organic electroluminescence (EL) display, or the like may be employed.
The input device 207 has keys for inputting characters, numbers, various instructions, and the like and is used for inputting data. The input device 207 may be a touch panel input pad, a numeric keypad, or the like or may be a keyboard, a mouse, or the like.
The portable-type recording medium I/F 208 controls reading and writing of data from and to the portable-type recording medium 209 in accordance with the control of the CPU 201. The portable-type recording medium 209 stores data written under the control of the portable-type recording medium I/F 208. Examples of the portable-type recording medium 209 include a compact disc (CD)-ROM, a Digital Versatile Disk (DVD), a Universal Serial Bus (USB) memory, and the like.
The information processing apparatus 200 may not include, for example, the disk drive 203, the disk 204, the portable-type recording medium I/F 208, and the portable-type recording medium 209, among the components described above. The conversion apparatus 101 illustrated in
(Specific Example of Program to be Converted)
A specific example of a program to be converted will be described with reference to
The instruction statement of the OpenMP is described by pragma (#pragma), and has a form such as “#pragma omp”. For example, “#pragma omp parallel” designates a section (parallel region) to be executed in parallel. “#pragma omp for” parallelizes a for statement. “#pragma omp single” designates a block to be executed by only one thread.
stmt0, stmt1, stmt2, and stmt3 are identifiers for identifying statements. stmt0 corresponds to “A[i]=A[i]+B[i]”. stmt1 corresponds to “func1(A[0])”. stmt2 corresponds to “A[i]=A[i]+C[i]”. stmt3 corresponds to “func2( )”.
(Functional Configuration Example of Information Processing Apparatus 200)
Next, a functional configuration example of the information processing apparatus 200 according to Embodiment 2 will be described.
The reception unit 401 receives a program to be converted. The program to be converted is a program of a data parallel description, for example, a program for HPC. Hereinafter, the program to be converted is referred to as a “program P”, in some cases. For example, the program P is the program 300 as illustrated in
For example, the reception unit 401 receives the program 300 by an operation input of the user who uses the input device 207 illustrated in
Based on a dependency relationship between statements in the program P, the generation unit 402 generates a directed graph G in which the statement in the program P is a node and the dependency relationship between the statements is an edge. The statement is a configuration unit of the program, and includes, for example, an equation, a function call, and the like. The dependency relationship between the statements is, for example, a relationship based on a data dependency of any of a flow dependency, an inverse flow dependency, and an output dependency. The node has, for example, data access information of the statement.
Hereinafter, the directed graph in which the statement in the program P is the node and the dependency relationship between the statements is the edge is referred to as a “directed graph G”, in some cases.
For example, the generation unit 402 analyzes the dependency relationship between the statements in the program P by dependency analysis of the program P by a compiler. The compiler is a translation program that converts a program described in a high-level language into a machine language that may be directly interpreted and executed by a computer. The dependency relationship is represented by, for example, which range of which variable between the statements has a dependency. Based on a result of the dependency analysis of the program P, the generation unit 402 generates the directed graph G.
A specific example of the directed graph G will be described below with reference to
Based on a dependency relationship represented by an edge in the generated directed graph G, the detection unit 403 detects, from the directed graph G, the node Ni of which a part of a loop process has a dependency relationship with the another preceding or following node Nj. The loop process is a process that is repeatedly executed.
The node Ni as the detection target is a node having at least the loop process. The another node Nj preceding the node Ni is the node Nj on a root side of the edge, which is coupled to the node Ni by the edge. The another node Nj following the node Ni is a node on a front side of an edge, which is coupled to the node Ni by the edge.
For example, the detection unit 403 determines whether or not a part of the loop process of the node Ni has a dependency relationship with the another node Nj, based on a dependency relationship between the nodes Ni and Nj, which represent which range of which variable is dependent. In a case where the part of the loop process has the dependency relationship with the another node Nj, the detection unit 403 detects the node Ni.
An example of detecting a node from the directed graph G will be described below with reference to
The update unit 404 divides the detected node Ni into a first node and a second node, fuses the divided first node and the another node Nj, and assigns dependency information based on a data access pattern to the node after fusing to update the directed graph G.
The first node is a node having only a part of the loop process having a dependency relationship with the another node Nj, in the loop process of the node Ni. The second node is a node having only the loop process other than the part of the loop process having the dependency relationship with the another node Nj, in the loop process of the node Ni. The fusing of the nodes means that two nodes are collectively handled as one task, and corresponds to a setting of a granularity of the task.
In a case where there is a dependency relationship between the node after fusing and the other node, the node after fusing and the other node are coupled by an edge. In a case where there is a dependency relationship between the second node and the other node, the second node and the other node are coupled by an edge.
The dependency information based on the data access pattern is information indicating what kind of access (read or write) is made to which data in the process (task) of each node. The dependency information assigned to the node after fusing is specified from, for example, data access information of the node after fusing.
For example, the dependency information includes information such as “depend (out: A[0])” assigned after #pragma omp. out: A[0] indicates writing to A[0]. The dependency information is information for making it possible to determine what kind of dependency exists between a task and a separate task at a runtime of the compiler.
An example of dividing the node Ni will be described below with reference to
The update unit 404 determines whether or not a node preceding the divided second node has a loop process. At this time, in a case where there are a plurality of nodes preceding the second node, the update unit 404 determines whether or not any node preceding the second node has the loop process.
In a case where the node preceding the second node does not have the loop process, the update unit 404 determines a task granularity (division granularity) in a case where the loop process of the second node is divided into a plurality of tasks, based on hardware information. The hardware information is information on hardware that executes the program P after conversion, and includes, for example, a size of a cache line of a core to which a task is allocated. The task granularity is represented by, for example, a loop length.
For example, the update unit 404 determines the task granularity such that the loop length is fitted in the size of the cache line. For the second node, the update unit 404 sets the determined task granularity, and assigns dependency information based on the data access pattern to update the directed graph G. For example, the dependency information assigned to the second node is specified from the data access information and the task granularity of the second node.
Therefore, the update unit 404 divides the loop process of the second node and enables the plurality of tasks to execute the loop process in parallel. At this time, in order to reduce the number of generated tasks, the update unit 404 sets a task granularity (division granularity) in consideration of the size of the cache line corresponding to the amount of data that may be processed at one time. Meanwhile, in a case where the number of iterations of the loop process of the second node is one, the update unit 404 does not divide the loop process of the second node (execution in one task).
An example of setting the task granularity for the second node and an example of assigning the dependency information to the second node will be described below with reference to
By contrast, in a case where the node preceding the second node has the loop process, the update unit 404 determines a task granularity for dividing the loop process of the second node into a plurality of tasks such that a data access range is aligned with the preceding node. The data access range indicates to which range of which data each task obtained by dividing the loop process accesses. For example, in a case where the node preceding the second node has the loop process and all the loop process has a dependency relationship with the preceding node, the update unit 404 determines a loop length such that the data access range is aligned with the preceding node.
For the second node, the update unit 404 sets the determined task granularity, and assigns dependency information based on the data access pattern to update the directed graph G. Therefore, the update unit 404 divides the loop process of the second node and enables the plurality of tasks to execute the loop process in parallel. At this time, since performance may be decreased when the granularity setting is performed in loop process unit, the update unit 404 sets the task granularity such that the data access range is aligned with the preceding node.
An example of determining the task granularity with which the data access range is aligned with the preceding node will be described below with reference to
For example, in a case where the directed graph G is updated, the detection unit 403 detects, from the directed graph G after update, the node Ni of which a part of the loop process has a dependency relationship with the another preceding or following node Nj. For example, the setting process of the task granularity is performed on all the nodes having the loop process in the directed graph G (the directed graph G after update). For example, the process of assigning the dependency information is performed on each node in the directed graph G (the directed graph G after update).
Based on the directed graph G after update, the conversion unit 405 converts the program P. For example, the update unit 404 converts the program P in the data parallel description into the program P in the dependent task parallel description, based on the directed graph G after update.
In details, for example, the conversion unit 405 uses an existing function of the compiler to generate the program P of the dependent task parallel description in which a computation is tasked, from the directed graph G after update. With the program P of the dependent task parallel description, read/write of data used in the task is explicitly described, based on the dependency information assigned to each node in the directed graph G after update.
A specific example of the program P after conversion will be described below with reference to
The output unit 406 outputs the program P after conversion. An output method by the output unit 406 includes, for example, storing in a storage device such as the memory 202 or the disk 204, transmitting to another computer via the communication I/F 205, and the like. Therefore, the output unit 406 passes the program P after conversion to the runtime of the compiler, or transmits the program P after conversion to the another computer (for example, an execution apparatus), for example.
The functional units (the reception unit 401 to the output unit 406) of the information processing apparatus 200 described above are realized by, for example, a compiler of the information processing apparatus 200.
(Specific Example of Directed Graph G)
A specific example of the directed graph G will be described with reference to
The directed graph 500 includes nodes N0 to N3 and edges e1 to e3. The node N0 represents stmt0 (statement) in the program 300. The node N1 represents stmt1 in the program 300. The node N2 represents stmt2 in the program 300. The node N3 represents stmt3 in the program 300.
The edge e1 represents a dependency relationship between stmt0 and stmt1. For example, the edge e1 indicates that there is a dependency (inverse flow dependency) of a variable A[0] between stmt0 and stmt1. The edge e2 represents a dependency relationship between stmt0 and stmt2. For example, the edge e2 indicates that there is a dependency (output dependency) of the variable A[0: N] between stmt0 and stmt2. N in [0: N] indicates the number of elements. [0: N] indicates a range of 0, 1, . . . , and N−1. The edge e3 represents a dependency relationship between stmt1 and stmt2. For example, the edge e3 indicates that there is a dependency (flow dependency) of the variable A[0] between stmt1 and stmt2. A separate node is not coupled to the node N3.
Each of the nodes N0 to N3 has, for example, data access information 501 to 504 of each of stmt0 to stmt3, as illustrated in the diagram 5B. The data access information 501 to 504 indicates an access range of a loop process of each of stmt0 to stmt3, a variable of an access (read/write) destination, and the like.
The data access information 501 is information included in the node N0, and indicates an access range “loop: 0<=i<N” of a loop process of stmt0, variables “A[i], B[i]” of a reading destination, and a variable “A[i]” of a writing destination. The data access information 502 is information included in the node N1, and indicates a variable “A[0]” of a reading destination of stmt1.
The data access information 503 is information included in the node N2, and indicates an access range “loop: 0<=i<N” of a loop process of stmt2, variables “A[i], C[i]” of a reading destination, and a variable “A[i]” of a writing destination. The data access information 504 is information included in the node N3, and indicates that there is no loop process in stmt3 and there is no variable of an access destination.
(Update Example of Directed Graph G)
An example of updating the directed graph G will be described with reference to
In the example of the directed graph 500 illustrated in
For example, in stmt0, from 0 to N−1 of i, there are read and write for the variable A, and there is read for a variable B. stmt1 has read for [0] of the variable A. Therefore, there is a dependency between stmt0 and stmt1 for [0] of the variable A. In this case, the detection unit 403 detects the node N0 from the directed graph 500. The node N0 has a part of the loop process (A[0]) having a dependency relationship with the other following node N1 in the loop process included in the node N0.
Hereinafter, as a combination of the node Ni and the another node Nj, the node N0 (data access information 501) and the node N1 (data access information 502) will be described as an example.
As illustrated in
The node N0b is a node having the part of the loop process (A[0]) having the dependency relationship with the other node N0 in the loop process of the node N1. The node N0b is coupled to the other node N1 by the edge e1. Each of the nodes N0a, N0b, and N1 has data access information 701, 702, and 502.
For example, the data access information 701 is information included in the node N0a, and indicates an access range “loop: 1<=i<N” of a loop process of stmt0a, variables “A[i], B[i]” of a reading destination, and a variable “A[i]” of a writing destination. stmt0a is a statement represented by the node N0a.
The data access information 702 is information included in the node N0b, and indicates variables “A[0], B[0]” of a reading destination and the variable “A[0]” of a writing destination of stmt0b. stmt0b is a statement represented by the node N0b.
As illustrated in
The update unit 404 updates the directed graph 500 by assigning dependency information 902 as illustrated in
For example, the dependency information 902 includes depend (out: A[0]) and depend (in: A[0], B[0]). depend (out: A[0]) indicates that there is writing for A[0]. depend (in: A[0], B[0]) indicates that there is reading for A[0] and B[0]. In the example of the dependency information 902 illustrated in
The node N0a divided from the node N0 has no preceding node, and the following node does not have a loop process. In this case, the update unit 404 determines a task granularity when the loop process of the node N0a is divided into a plurality of tasks, based on hardware information. For example, the update unit 404 determines the task granularity such that a loop length is fitted in a size of a cache line.
It is assumed that the task granularity when the loop process of the node N0a is divided into the plurality of tasks is determined to be “cache”. In this case, the update unit 404 sets the determined task granularity “cache” to the node N0a, and assigns the dependency information 901 as illustrated in
The dependency information 901 is information based on a data access pattern in the node N0a. The data access pattern of the node N0a is specified from the data access information 701. For example, the dependency information 901 includes depend (out: A[ii: cache]) and depend (in: A[ii: cache], B[ii: cache]). ii is an integer of 1 to N−1.
cache is a task granularity determined in accordance with the size of the cache line. Based on this task granularity, the loop process included in the node N0a is divided into the plurality of tasks. For example, in the example of the dependency information 901, a first task is executed for a size of one cache line from 1 of ii, and a second task is executed for the size of one cache line from a position shifted by the size of one cache line from 1 of ii.
depend (out: A[ii: cache]) indicates that there is writing to A[ii: cache]. depend (in: A[ii: cache], B[ii: cache]) indicates that there is reading for A[ii: cache], B[ii: cache]. In the example of the dependency information 901 illustrated in
Therefore, it is possible to obtain the directed graph 500 in which the information (for example, the dependency information 901 and 902) desirable for conversion into a dependent task parallel description is assigned to each node (for example, the node N0a and the node after fusing (N0b+N1)).
(Example of Determining Task Granularity with Data Access Range Aligned with Preceding Node)
An example of determining a task granularity with which a data access range is aligned with a preceding node will be described with reference to
A dependency relationship of a variable A[0: 6] exists between the node representing stmt0 and the node representing stmt1. For example, the node N1 preceding the node N2 has a loop process, and all the loop process has a dependency relationship between the node N1 and the node N2. It is assumed that a division granularity at which the loop process of stmt0 represented by the node N1 is divided into three tasks is determined based on hardware information.
Data access information 1001 is information included in the node N1, and indicates an access range “loop: 0<=i<2” of a loop process of stmt0a and a variable “A[i]” of a writing destination. stmt0a indicates a first task in a case where stmt0 is divided into three.
Data access information 1002 is information included in the node N1, and indicates an access range “loop: 2<=i<4” of a loop process of stmt0b and the variable “A[i]” of a writing destination. stmt0b indicates a second task in the case where stmt0 is divided into three.
Data access information 1003 is information included in the node N1, and indicates an access range “loop: 4<=i<6” of a loop process of stmt0c and the variable “A[i]” of a writing destination. stmt0c indicates a third task in the case where stmt0 is divided into three.
As illustrated on a left side in
As illustrated on a right side in
In this case, stmt1a has a dependency relationship with only stmt0a. stmt1b has a dependency relationship with only stmt0b. stmt1c has a dependency relationship with only stmt0c. As described above, in the case where stmt1 is divided into three tasks, the dependency relationships are reduced, as compared with the case where stmt1 is divided into two tasks.
For example, in the case where stmt1 is divided into two tasks, the dependency relationships are increased, as compared with a case where stmt1 is divided into three tasks, and thus there is a possibility that performance is decreased. Accordingly, the update unit 404 determines a task granularity when dividing the loop process of the node N2 into a plurality of tasks to the same task granularity as the preceding node N1.
Therefore, the update unit 404 may increase a speed by aligning the data access ranges between the loop processes having the dependency relationships.
A specific example of the program P after conversion will be described with reference to
(Conversion Process Procedure of Information Processing Apparatus 200)
A conversion process procedure of the information processing apparatus 200 according to Embodiment 2 will be described.
In a case where the program P to be converted is received (Yes in step S1301), the information processing apparatus 200 generates the directed graph G, based on a dependency relationship between statements in the program P (step S1302). The directed graph G is information in which the statement in the program P is a node and the dependency relationship between the statements is an edge.
After that, the information processing apparatus 200 selects the unselected node Ni, that is not selected from the directed graph G (step S1303). The directed graph G as a selection source is the directed graph G generated in step S1302 or the directed graph G after update in which dependency information is assigned to each node in step S1306.
At this time, for example, the information processing apparatus 200 first selects a root node of the directed graph G, and then sequentially selects a following node. For example, in a case where there are a plurality of following nodes, the information processing apparatus 200 selects the closest node in the program among the plurality of following nodes. In a case where there is no following node, the information processing apparatus 200 selects, for example, the uppermost unselected node.
After that, the information processing apparatus 200 determines whether or not the selected node Ni has a loop process (step S1304). In a case where the node Ni does not have the loop process (No in step S1304), the information processing apparatus 200 proceeds to step S1306. By contrast, in a case where the node Ni has the loop process (Yes in step S1304), the information processing apparatus 200 executes a division and fusion process (step S1305).
The division and fusion process is a process of dividing the node Ni and fusing the divided node Ni with the another node Nj. A specific processing procedure of the division and fusion process will be described below with reference to
By assigning dependency information based on a data access pattern to each node, the information processing apparatus 200 updates the directed graph G (step S1306). A node to which the dependency information is to be assigned is, for example, the node Ni selected in step S1303 or a node after fusing fused in step S1403 illustrated in
After that, the information processing apparatus 200 determines whether or not there is an unselected node that is not selected from the directed graph G (step S1307). In a case where there is the unselected node (Yes in step S1307), the information processing apparatus 200 returns to step S1303.
By contrast, in a case where there is no unselected node (No in step S1307), the information processing apparatus 200 converts the program P based on the directed graph G after update (step S1308). After that, the information processing apparatus 200 outputs the program P after conversion (step S1309), and ends a series of processes according to the present flowchart.
Therefore, the information processing apparatus 200 may convert the program P of a data parallel description into the program P of a dependent task parallel description.
A specific processing procedure of the division and fusion process in the step S1305 will be described with reference to
In a case where the part of the loop process does not have the dependency relationship with the another preceding or following node Nj (No in step S1401), the information processing apparatus 200 proceeds to step S1404. By contrast, in a case where the part of the loop process has the dependency relationship with the another preceding or following node Nj (Yes in step S1401), the information processing apparatus 200 divides the selected node Ni into a first node and a second node (step S1402).
The first node is a node having only a part of the loop process having a dependency relationship with the another node Nj, in the loop process of the node Ni. The second node is a node having only the loop process other than the part of the loop process having the dependency relationship with the another node Nj, in the loop process of the node Ni.
The information processing apparatus 200 fuses the divided first node and the another node Nj (step S1403). After that, the information processing apparatus 200 determines whether or not the selected node Ni or a node preceding the divided second node has a loop process (step S1404).
In a case where the preceding node does not have the loop process (No in step S1404), the information processing apparatus 200 determines a task granularity when the loop process included in the node Ni or the second node is divided into a plurality of tasks based on the hardware information (step S1405), and returns to the step in which the division and fusion process is called.
By contrast, in a case where the preceding node has the loop process (Yes in step S1404), the information processing apparatus 200 determines the task granularity when the loop process of the node Ni or the second node is divided into the plurality of tasks (step S1406) such that a data access range is aligned with the preceding node, and returns to the step in which the division and fusion process is called.
Therefore, in a case where only a part of the loop process of the node Ni has a dependency relationship with the another preceding or following node Nj, the information processing apparatus 200 may reduce the number of generated tasks by dividing only the location into separate nodes and fusing the separate node with the another node Nj. The information processing apparatus 200 may determine an appropriate task granularity when the loop process is divided into a plurality of tasks, based on hardware information or a data access range of the preceding node.
As described above, with the information processing apparatus 200 according to Embodiment 2, it is possible to generate the directed graph G in which the statement in the program P is a node and a dependency relationship between the statements is an edge, based on the dependency relationship between the statements in the program P of a data parallel description. With the information processing apparatus 200, it is possible to detect, from the directed graph G, the node Ni of which a part of the loop process having a dependency relationship with the another preceding or following node Nj, based on a dependency relationship represented by the edge in the generated directed graph G. With the information processing apparatus 200, it is possible to update the directed graph G by dividing the detected node Ni into a first node having a part of the loop process and a second node having the loop process other than the part of the loop process, fusing the divided first node and the another node, and assigning dependency information based on a data access pattern to the node after fusing. With the information processing apparatus 200, it is possible to convert the program P by a data parallel description into the program P by a dependent task parallel description, based on the directed graph G after update.
Therefore, in a case where only the part of the loop process of the node Ni has the dependency relationship with the another preceding or following node Nj, the information processing apparatus 200 may divide the part into separate nodes and fuse the separate node and the another node Nj. Therefore, in task parallelization, it is possible to reduce the number of generated tasks while acquiring parallelism, and to improve parallelization efficiency.
With the information processing apparatus 200, in a case where a node preceding the second node does not have a loop process, it is possible to determine a task granularity when a loop process of the second node is divided into a plurality of tasks, based on the hardware information. With the information processing apparatus 200, it is possible to update the directed graph G by setting the determined task granularity and assigning dependency information based on a data access pattern to the second node.
Therefore, the information processing apparatus 200 may improve the parallelization efficiency by dividing the loop process (a plurality of processes) into tasks having an appropriate granularity, based on the hardware information. For example, the information processing apparatus 200 may determine a task granularity when the loop process of the second node is divided into a plurality of tasks, based on a size of a cache line included in the hardware information. In this case, the task granularity may be set in consideration of the size of the cache line corresponding to the amount of data that may be processed at one time, and the number of generated tasks may be reduced while improving use efficiency of a cache memory.
With the information processing apparatus 200, in a case where a node preceding the second node has a loop process, it is possible to determine the task granularity when the loop process of the second node is divided into the plurality of tasks such that the data access range is aligned with the preceding node. For example, in a case where a node preceding the second node has a loop process and all loop process has a dependency relationship with the preceding node, the information processing apparatus 200 determines a task granularity such that the data access range is aligned with the preceding node. With the information processing apparatus 200, it is possible to update the directed graph G by setting the determined task granularity and assigning dependency information based on a data access pattern to the second node.
Therefore, the information processing apparatus 200 aligns the data access range between the loop processes having the dependency relationship to reduce an increase in the dependency relationship between the tasks and achieve a high-speed.
With the information processing apparatus 200, it is possible to generate the directed graph G, based on the dependency relationship based on any data dependency of the flow dependency, the inverse flow dependency, and the output dependency between the statements in the program P.
Therefore, the information processing apparatus 200 may generate the directed graph G, based on the data dependency.
With the information processing apparatus 200, it is possible to output the program P after conversion (program P in the dependent task parallel description).
Therefore, the information processing apparatus 200 may pass the program P after conversion to a runtime of a compiler or transmit the program P after conversion to another computer (for example, an execution apparatus).
From these, with the information processing apparatus 200 according to Embodiment 2, it is possible to reduce an overhead by reducing the number of generated tasks while acquiring parallelism by setting the task having an appropriate granularity, and it is possible to improve performance of the HPC program.
The conversion method described in the present embodiment may be realized by executing a program prepared in advance by a computer such as a personal computer or a workstation. The conversion program is recorded in a computer-readable recording medium such as a hard disk, a flexible disc, a CD-ROM, a DVD, or a USB memory, and is executed by being read by the computer from the recording medium. The conversion program may be distributed via a network such as the Internet.
The conversion apparatus 101 (information processing apparatus 200) described in the present embodiment may also be realized by an integrated circuit (IC) for specific application, such as a standard cell or a structured application-specific integrated circuit (ASIC), or by a programmable logic device (PLD), such as a field-programmable gate array (FPGA).
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-198907 | Dec 2021 | JP | national |