The disclosure claims priority to Chinese Patent Application No. 202211177797.3 filed to the State Intellectual Property Office of China on Sep. 27, 2022 and entitled “Instruction Execution Method and Apparatus for Graph Computation”, which is incorporated herein by reference in its entirety.
The disclosure relates to the technical field of computer systems based on specific computing models, in particular to an instruction execution method and apparatus for graph computation.
With neural network models put into practice in recent years, the technology for neural network compilation becomes more and more important. The existing computational graph compilation technology of neural network models has not yet analyzed the dependency relationship among instructions contained in nodes during the execution of a computational graph from a global perspective, and not derived, based on the dependency relationship, a topological order of the instructions that can be executed in parallel in the global computational graph. This leads to a great deal of memory consumption in compiling the neural network model and brings about slower execution efficiency of the computational graph when run on a computer.
By analyzing the dependency relationship among instructions during the execution of a computational graph and building a topological order of parallel instructions, the disclosure provides a method and apparatus for scheduling parallel instructions to hardware resources fastest, and provides a compilation technology for instruction execution methods and apparatuses for graph computation. The objective of the disclosure is to provide an instruction execution method and apparatus for graph computation, which solve the problem of how to analyze the dependency relationship among instructions contained in nodes during the execution of a computational graph from a global perspective, and to derive, based on the dependency relationship, a topological order of the instructions that can be executed in parallel in the global computational graph, so as to schedule the parallel instructions to hardware resources fastest.
The technical solutions adopted by the disclosure are as follows:
An instruction execution method for graph computation includes the following steps:
Further, the instruction dependency relationship in step S3 includes a write-read strong dependency relationship, a read-write weak dependency relationship and a write-write weak dependency relationship.
Further, the write-read strong dependency relationship is: writing a register first and then reading the same register according to instruction operations, where the instruction operation of reading the same register later depends on the instruction operation of writing the register first.
Further, the read-write weak dependency relationship is: reading a register first and then writing the same register according to instruction operations, where the instruction operation of writing the same register later depends on the instruction operation of reading the register first.
Further, the write-write weak dependency relationship is: writing a register first and then writing the same register according to instruction operations, where the instruction operation of writing the same register later depends on the instruction operation of writing the register first.
Further, the specific steps of step S4 are: traversing each node in turn according to the topological structure of the computational graph, and building dependency relationship edges of each node by analyzing the dependency relationship between each node instruction and successor node instructions thereof, to form the instruction dependency relationship graph.
Further, the specific steps of step S5 are: traversing each computing node in turn according to the topological structure of the computational graph, and obtaining parallel executable instructions in each step of the execution flow according to the instruction dependency relationship graph, to obtain the topological order of parallel instructions.
Further, the specific step of step S6 is: scheduling the parallel executable instructions in each step to the corresponding hardware resources according to the topological order of the instruction dependency relationship graph.
The disclosure further provides an instruction execution apparatus for graph computation, including a memory and one or more processors, the memory storing executable codes, and the one or more processors executing the executable codes to implement the instruction execution method for graph computation in any of the foregoing embodiments.
The disclosure further provides a computer-readable storage medium storing a program that, when executed by a processor, implements the instruction execution method for graph computation in any of the foregoing embodiments.
The beneficial effects of the disclosure are as follows: the disclosure analyzes the dependency relationship among instructions contained in nodes during the execution of a computational graph from a global perspective, and derives, based on the dependency relationship, a topological order of the instructions that can be executed in parallel in the global computational graph, so as to provide a method and apparatus for scheduling the parallel instructions to hardware resources fastest. The instruction execution efficiency of graph computation is improved by analyzing and designing parallel computation operations, and a compilation technology for instruction execution methods and apparatuses for graph computation is provided. When developing algorithm models, researchers and engineering users use an optimization model for the instruction execution method and apparatus for graph computation to optimize the compilation efficiency of the computational graph and promote the development of landing applications of a neural network model in the relationship graph.
The following description of at least one exemplary embodiment is in fact illustrative only, and is definitely not intended to limit the disclosure and the application or use thereof. All other embodiments obtained by those of ordinary skill in the art based on the embodiments in the disclosure without any creative effort fall within the scope of protection of the disclosure.
With reference to
Embodiment:
An instruction execution method for graph computation includes the following steps: See
See
The instruction dependency relationship includes a write-read strong dependency relationship, a read-write weak dependency relationship and a write-write weak dependency relationship;
Further, the write-read strong dependency relationship is: writing a register first and then reading the same register according to instruction operations, where the instruction operation of reading the same register later depends on the instruction operation of writing the register first;
Further, the read-write weak dependency relationship is: reading a register first and then writing the same register according to instruction operations, where the instruction operation of writing the same register later depends on the instruction operation of reading the register first;
Further, the write-write weak dependency relationship is: writing a register first and then writing the same register according to instruction operations, where the instruction operation of writing the same register later depends on the instruction operation of writing the register first.
Step S4: Build an instruction dependency relationship graph;
Each node is traversed in turn according to the topological structure of the computational graph, and dependency relationship edges of each node are built by analyzing the dependency relationship between each node instruction and successor node instructions thereof, to form the instruction dependency relationship graph;
The analysis on the dependency relationship between each node instruction and successor node instructions thereof refers to the analysis on the dependency relationship between each node instruction and successor node instructions thereof, the dependency relationship including a write-read strong dependency relationship, a read-write weak dependency relationship and a write-write weak dependency relationship.
represents that the parallel instructions that can be executed simultaneously in step 1 include the instruction at the Vi node.
Node V1: node V1 contains a write register r1, and node V3 contains a read register r1, so node V1 and node V3 have a write-read strong dependency relationship between instructions.
Node V2: node V2 contains a write register r2, and node V3 contains a read register r2, so node V2 and node V3 have a write-read strong dependency relationship between instructions.
Node V3: 1) node V3 contains the read register r2, and node V4 contains the write register r2, so node V3 and node V4 have a read-write weak dependency relationship between instructions. 2) Node V3 contains the write register r1, and node V7 contains the read register r1, so node V3 and node V7 have a write-read strong dependency relationship between instructions.
Node V4: node V4 contains the write register r2, and node V6 contains the read register r2, so node V4 and node V6 have a write-read strong dependency relationship between instructions.
Node V5: node V5 contains a write register r3, and node V6 contains a read register r3, so node V5 and node V6 have a write-read strong dependency relationship between instructions.
Node V6: 1) node V6 contains the write register r2, and node V7 contains the read register r2, so node V6 and node V7 have a write-read strong dependency relationship between instructions. 2) Node V6 contains the read register r3, and node V9 contains the write register r3, so node V6 and node V9 have a read-write weak dependency relationship between instructions.
Node V7: node V7 contains the read register r2, and node V8 contains the write register r2, so node V7 and node V8 have a read-write weak dependency relationship between instructions.
Node V8: node V8 contains the write register r2, and node V10 contains the read register r2, so node V8 and node V10 have a write-read strong dependency relationship between instructions.
Node V9: node V9 contains the write register r3, and node V10 contains the read register r3, so node V9 and node V10 have a write-read strong dependency relationship between instructions.
Node V10: node V10 contains the write register r2, and node V11 contains the read register r2, so node V10 and node V11 have a write-read strong dependency relationship between instructions.
Step S5: Build a topological order of parallel instructions;
Each computing node is traversed in turn according to the topological structure of the computational graph, and parallel executable instructions in each step of the execution flow are obtained according to the instruction dependency relationship graph, to obtain the topological order of parallel instructions;
The parallel executable instructions in each step indicates that, when the state of the current instruction to be analyzed is executed during running, if the current instruction to be analyzed has no any dependent precursor node in the instruction dependency relationship graph, the current parallel executable instructions include the current instruction to be analyzed.
Parallel executable instructions in the first step: the instructions contained in node V1, node V2 and node V5, which have no dependency relationship, can be executed in parallel in the first step.
Parallel executable instruction in the second step: because node V3 depends on the instructions contained in node V1 and node V2, the instruction contained in node V3 can be executed in the second step. Node V6 depends on node V4 in addition to node V5, and node V4 depends on node V3, so node V6 and node V3 have an indirect dependency relationship, and the instruction contained in node V6 cannot be executed in the second step. It is finally concluded that the instruction contained in node V3 can be executed in parallel in the second step.
Parallel executable instruction in the third step: the nodes directly dependent on node V3 include V4 node and V7 node. In addition, node V4 depends only on node V3, so the instruction contained in node V4 can be executed in the third step. Node V7 depends on node V6 in addition to node V3, and node V6 depends on node V4, so node V7 and node V4 have an indirect dependency relationship, and the instruction contained in node V7 cannot be executed in the third step. It is finally concluded that the instruction contained in node V4 can be executed in parallel in the third step.
Parallel executable instruction in the fourth step: the nodes directly dependent on node V4 include only V6 node. Although node V6 depends on node V5 in addition to node V4, the instruction contained in node V5 has been executed in the first step, so it can be regarded as that node V6 depends only on node V4 in the fourth step. Therefore, the instruction contained in node V6 can be executed in the fourth step. It is finally concluded that the instruction contained in node V6 can be executed in parallel in the fourth step.
Parallel executable instructions in the fifth step: the nodes directly dependent on node V6 include V7 node and V9 node, and node V9 depends only on node V6. It is finally concluded that the instructions contained in node V7 and V9 can be executed in parallel in the fifth step.
Parallel executable instruction in the sixth step: the nodes directly dependent on node V7 include V8 node, the nodes directly dependent on node V9 include V10 node, but node V10 depends on node V8. It is finally concluded that the instruction contained in node V8 can be executed in parallel in the sixth step.
Parallel executable instruction in the seventh step: the nodes directly dependent on node V8 include node V10, node V10 also depends on node V9, but the instruction contained in node V9 has been executed in the fifth step. It is finally concluded that the instruction contained in node V10 can be executed in parallel in the seventh step.
Parallel executable instruction in the eighth step: the nodes directly dependent on node V10 include only V11 node. It is finally concluded that the instruction contained in node V11 can be executed in parallel in the eighth step.
Step S6: Schedule the parallel instructions to hardware resources;
According to the topological order of the instruction dependency relationship graph, the parallel executable instructions in each step are scheduled to the corresponding hardware resources;
The parallel executable instructions in each step are scheduled to the corresponding hardware resources, where data loading instructions LD and data storage instructions ST about data handling are scheduled to a memory unit, and instructions about arithmetic operations are scheduled to an arithmetic logic unit. The scheduling of instructions to hardware resources indicates that the parallel instructions in each step are scheduled to a position where the corresponding hardware resources can be executed at the earliest. Considering that the resources related to a hardware memory port are always being used by the instruction contained in a precursor node on which the current instruction depends, the position where the hardware resources can be executed at the earliest is the position where the execution of the instruction contained in the precursor node on which the current instruction depends in the topological graph of the instruction dependency relationship ends.
Schedule the parallel instructions in the first step: the scheduling of the parallel instructions in the first step includes the following process: 1) the parallel instructions in the first step include instructions contained in node V1, node V2 and node V5, and the instructions are all data handling instructions, so the instructions contained in node V1, node V2 and node V5 are scheduled to the memory unit. 2) The instructions contained in node V1, node V2 and node V5 are scheduled to a position where the execution begins in the memory unit at the earliest, that is, the initial position of the memory unit, such as the position identified by symbol {circle around (1)} in the arithmetic logic unit in
Schedule the parallel instruction in the second step: the scheduling of the parallel instruction in the second step includes the following process: 1) the parallel instruction in the second step includes the instruction contained in node V3, and the instruction is an arithmetic operation instruction, so the instruction contained in node V3 is scheduled to the arithmetic logic unit. 2) The instruction contained in node V3 is scheduled to a position where the execution begins in the arithmetic logic unit at the earliest, such as the position identified by symbol {circle around (2)} in the arithmetic logic unit in
Schedule the parallel instruction in the third step: the scheduling of the parallel instruction in the third step includes the following process: 1) the parallel instruction in the third step includes the instruction contained in node V4, and the instruction is a data handling instruction, so the instruction contained in node V4 is scheduled to the memory unit. 2) The instruction contained in node V4 is scheduled to a position where the execution begins in the memory unit at the earliest, such as the position identified by symbol {circle around (3)} in the arithmetic logic unit in
Schedule the parallel instruction in the fourth step: the scheduling of the parallel instruction in the fourth step includes the following process: 1) the parallel instruction in the fourth step includes the instruction contained in node V6, and the instruction is an arithmetic operation instruction, so the instruction contained in node V6 is scheduled to the arithmetic logic unit. 2) The instruction contained in node V6 is scheduled to a position where the execution begins in the arithmetic logic unit at the earliest, such as the position identified by symbol {circle around (4)} in the arithmetic logic unit in
Schedule the parallel instructions in the fifth step: the scheduling of the parallel instructions in the fifth step includes the following process: 1) the parallel instructions in the fifth step include instructions contained in node V7 and node V8, the instruction contained in node V9 is a data handling instruction, and the instruction contained in node V7 is an arithmetic operation instruction, so the instruction contained in node V9 is scheduled to the memory unit, and the instruction contained in node V7 is scheduled to the arithmetic logic unit. 2) The instruction contained in node V9 is scheduled to a position where the execution begins in the memory unit at the earliest, such as the position identified by symbol {circle around (5)} in the arithmetic logic unit in
Schedule the parallel instruction in the sixth step: the scheduling of the parallel instruction in the sixth step includes the following process: 1) the parallel instruction in the sixth step includes the instruction contained in node V8, and the instruction is a data handling instruction, so the instruction contained in node V8 is scheduled to the memory unit. 2) The instruction contained in node V8 is scheduled to a position where the execution begins in the memory unit at the earliest, such as the position identified by symbol {circle around (6)} in the arithmetic logic unit in
Schedule the parallel instruction in the seventh step: the scheduling of the parallel instruction in the seventh step includes the following process: 1) the parallel instruction in the seventh step includes the instruction contained in node V10, and the instruction is an arithmetic operation instruction, so the instruction contained in node V10 is scheduled to the arithmetic logic unit. 2) The instruction contained in node V10 is scheduled to a position where the execution begins in the arithmetic logic unit at the earliest, such as the position identified by symbol {circle around (7)} in the arithmetic logic unit in
Schedule the parallel instruction in the eighth step: the scheduling of the parallel instruction in the eighth step includes the following process: 1) the parallel instruction in the eighth step includes the instruction contained in node V11, and the instruction is an arithmetic operation instruction, so the instruction contained in node V11 is scheduled to the arithmetic logic unit. 2) The instruction contained in node V11 is scheduled to a position where the execution begins in the arithmetic logic unit at the earliest, such as the position identified by symbol {circle around (8)} in the arithmetic logic unit in
Step S7: Build shortest schedules for the parallel instructions: the shortest time required to execute the parallel instructions under the condition of limited hardware resources;
The building of the shortest schedules for the parallel instructions refers to the shortest time required to execute the parallel instructions under the condition of limited hardware resources. It is assumed that all instruction operations require one clock cycle, with the exception of the data loading instruction LD, which requires two clock cycles. Considering the mechanism that hardware resources cache data to be loaded into a temporary table for the situation of loading first and then storing immediately, and then the data are stored to memory resources from the temporary table when the data storage instructions need to be executed, the data storage instruction ST at the same storage position can be executed at a clock following the start of the data loading instruction LD at this position. In the process of building the shortest schedules for the parallel instructions, because each data handling instruction occupies a hardware memory port during execution, when a plurality of data handling instructions need to be executed in parallel, only one data handling instruction can be executed at a time, and the order of execution can be based on the order principle of priority to the instructions that can be executed at the earliest in the topological graph of the instruction dependency relationship.
The building of the shortest schedules for the parallel instructions includes the following process:
Shortest schedule for the parallel instructions in the first step: the parallel instructions in the first step include data loading instructions LD contained in node V1, node V2 and node V5 among the data handling instructions, and the execution time for each data loading instruction needs two clock cycles, so according to the order principle of instructions that can be executed at the earliest in the topological graph of the instruction dependency relationship, the data loading instructions LD contained in node V1, node V2 and node V5 are sequentially executed, which takes a total of 6 clock cycles.
Shortest schedule for the parallel instruction in the second step: because the parallel instruction in the second step includes an arithmetic operation instruction SUB contained in node V3, it takes a total of 1 clock cycle to execute the operation.
Shortest schedule for the parallel instruction in the third step: because the parallel instruction in the third step includes a data loading instruction LD contained in node V3 among the data handling instructions, it takes a total of 2 clock cycles to execute the operation.
Shortest schedule for the parallel instruction in the fourth step: because the parallel instruction in the fourth step includes an arithmetic operation instruction MUL contained in node V6, it takes a total of 1 clock cycle to execute the operation.
Shortest schedule for the parallel instructions in the fifth step: because the parallel instructions in the fifth step include an arithmetic operation instruction ADD contained in node V7 and a data loading instruction LD contained in node V9 among the data handling instructions, the ADD instruction contained in node V7 and the data loading instruction LD contained in node V9 can be executed simultaneously, which takes 1 clock cycle to execute the ADD instruction contained in node V7 and 2 clock cycles to execute the data loading instruction LD contained in node V9. Therefore, this operation needs a total of 2 clock cycles.
Shortest schedule for the parallel instruction in the sixth step: because the parallel instruction in the sixth step includes a data loading instruction LD contained in node V8 among the data handling instructions, it takes a total of 2 clock cycles to execute the operation.
Shortest schedule for the parallel instruction in the seventh step: because the parallel instruction in the seventh step includes an arithmetic operation instruction ADD contained in node V10, it takes a total of 1 clock cycle to execute the operation.
Shortest schedule for the parallel instruction in the eighth step: because the parallel instruction in the eighth step includes an arithmetic operation instruction SUB contained in node it takes a total of 1 clock cycle to execute the operation.
The time required to execute the entire topological graph of the instruction dependency relationship is an accumulation of times required for the shortest schedules for the parallel instructions in the above steps. Therefore, the time required to execute the entire topological graph of the instruction dependency relationship is 6+1+2+1+2+2+1+1, that is, it takes a total of 16 clock cycles to execute the topological graph, as shown in
Corresponding symbol meanings in
©: a represents that the execution of parallel instructions in step c requires a clock cycles, such as {circle around (1)}: 6 represents that the execution of parallel instructions in the first step requires 6 clock cycles.
Step S8: Release the completed instructions.
The method as stated above analyzes the dependency relationship among instructions contained in nodes during the execution of a computational graph from a global perspective, and derives, based on the dependency relationship, a topological order of the instructions that can be executed in parallel in the global computational graph, so as to provide a method and apparatus for scheduling the parallel instructions to hardware resources fastest. The instruction execution efficiency of graph computation is improved by analyzing and designing parallel computation operations, and a compilation technology for instruction execution methods and apparatuses for graph computation is provided. When developing algorithm models, researchers and engineering users use an optimization model for the instruction execution method and apparatus for graph computation to optimize the compilation efficiency of the computational graph and promote the development of landing applications of a neural network model in the relationship graph.
Corresponding to the foregoing embodiment of the instruction execution method for graph computation, the disclosure further provides an embodiment of an instruction execution apparatus for graph computation.
With reference to
The embodiment of the instruction execution apparatus for graph computation according to the disclosure may be applied to any device having data processing capability, which may be a device or apparatus such as a computer. The embodiment of the apparatus can be implemented by software, hardware, or by a combination of hardware and software. Taking the software implementation as an example, the logical apparatus is formed by reading corresponding computer program instructions in a non-volatile memory into a memory through a processor of any device having data processing capability where the apparatus is located. From the hardware level, as shown in
The implementation processes of the functions and effects of the units in the foregoing apparatus are detailed in the implementation processes of the corresponding steps in the foregoing method, and details are not described herein again.
The embodiment of the apparatus substantially corresponds to the embodiment of the method, so relevant parts may refer to the parts of the embodiment of the method. The apparatus examples described above are merely illustrative. The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the disclosure. Those of ordinary skill in the art can understand and implement without any creative effort.
An embodiment of the disclosure further provides a computer-readable storage medium storing a program that, when executed by a processor, implements the instruction execution method for graph computation in the foregoing embodiment.
The computer-readable storage medium may be an internal storage unit of any device having data processing capability descried in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be an external storage device of any device having data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD card, or a flash card equipped on the device. Further, the computer-readable storage medium may further include both an internal storage unit of any device with data processing capability and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the device with data processing capability, and may also be used to temporarily store data that has been output or will be output.
Described above are only the preferred embodiments of the disclosure, and are not intended to limit the disclosure. The disclosure may have various modifications and variations for those skilled in the art. Any modification, equivalent substitution or improvement made within the spirit and principle of the disclosure shall fall into the protection scope of the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202211177797.3 | Sep 2022 | CN | national |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/124006 | Oct 2022 | US |
Child | 18071978 | US |