This application claims priority to Chinese patent application No. 202211177796.9, filed with the China National Intellectual Property Administration on Sep. 27, 2022, the disclosure of which is incorporated by reference herein in its entirety.
The disclosure relates to the technical field of computer systems based on specific computation models, in particular to an optimization method and apparatus for compiling a computation graph.
With the implementation of a neural network model in recent years, a neural network compiling-oriented technology has become more and more important. The neural network model using deep learning technology needs to use a lot of training data, which brings a huge processing burden to the computer processing system using neural network model. Existing computation graph compiling technologies have not analyzed a constraint relationship between nodes in an execution process of a computation graph from the global perspective, and have not analyzed, on the basis of the constraint relationship, dynamic changes of the life cycle of tensor variables contained in the nodes of the computation graph in different states in the execution process. This makes the computer consume more memory resources when processing the task containing the neural network model, and the CPU's register resource consumption is also more strained.
To this end, the disclosure proposes to abstract a dynamic change process of a node state in the execution process of the computation graph into a constraint-based set representation method, and provides an intermediate representation technology based on a set of nodes containing tensor variables.
In order to solve the above technical problems, the disclosure provides an optimization method and apparatus for compiling a computation graph.
The technical solution adopted by the disclosure is as follows:
An optimization method for compiling a computation graph includes the following steps:
Further, converting the computation graph into the intermediate representation specifically includes the following sub-steps:
Further, analyzing the dependency relationship includes: analyzing and deducing a relationship among the input node sets of the various nodes of the computation graph.
Further, constructing and saving the work stack: traversing the computation graph according to a topological order, and pressing the nodes in the computation graph into the work stack in sequence.
Further, initializing the node elements includes: initializing all the nodes of the computation graph that have not been executed to be in a nonactivated state.
Further popping out the stack top node elements from the work stack includes the following sub-steps:
Further, implementing the intermediate representation in the fixed node state includes: mapping, when the input node set of the various nodes in the intermediate representation of the computation graph reaches the fixed node state, node elements contained to 1, and mapping other node elements to 0.
Further, allocating the registers for the tensor variables includes: allocating idle registers for tensor variables contained in nodes whose node elements contained are mapped to 1 when the input node set reaches the fixed node state.
The disclosure further provides an optimization apparatus for compiling a computation graph, including a memory and one or more processors, the memory stores executable instructions; and the one or more processors execute the executable instructions to implement the optimization method for compiling a computation graph according to any one of the above embodiments.
The disclosure further provides a computer-readable storage medium, which stores programs, the programs, when executed by a processor, implements the optimization method for compiling a computation graph according to any one of the above embodiments.
The beneficial effects of the disclosure are as follows: The disclosure discloses an optimization method and device for compiling a computation graph. The method is an optimization method for compiling a computation graph. The disclosure proposes conversion of a computation graph into an intermediate representation based on a set of nodes containing tensor variables, and provides a method for analyzing that nodes of the intermediate representation are dynamically executed to a fixed node state, and optimizes an implementation method for allocating the idle registers to the tensor variables contained in the various nodes of the intermediate representation in the fixed node state. Through the above optimization method for compiling a computation graph, a work stack for the intermediate representation is used to improve the utilization efficiency of compiling memory, reduce the required amount computer memory resources for running a neural network model, save the register resources of CPU cores which need to be allocated when the neural network model is running on the computer, and finally improve data training efficiency and data input and output efficiency of the neural network model. The optimization method for compiling a computation graph of the disclosure improves the execution efficiency of the computation graph at runtime. In the process of developing algorithm models, researchers and engineering users use the optimization method and apparatus for compiling a computation graph to optimize the models, thus optimizing the compiling efficiency of the computation graph and promoting the development of implementation and application of a neural network model in the relational graph.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure and its application or uses. Based on the embodiments in the disclosure, all other embodiments obtained by those of ordinary skill in the art without doing creative work shall fall within the protection scope of the disclosure.
Referring to
An optimization method for compiling a computation graph includes the following steps:
Referring to
The sets of the nodes containing the tensor variables of the intermediate representation are acquired by means of iteratively deducing that each node contains a tensor variable, until the input node sets and the output node sets of all the nodes no longer change, i.e. until the node elements contained in all the sets are fixed nodes. The iteration process is as follows:
Referring to
In the first round of iteration, the input node set and the output node set of each node change below:
Referring to
In the second round of iteration, the input node set and the output node set of each node change below:
The set representation of the node V2 is denoted as:
V
2_IN=V1_OUT={0x, 2y, 4x, 7x, 8z},
V
2_OUT=V2_IN={0x, 2y, 4x, 7x, 8z};
The set representation of the node V3 is denoted as:
V
3_IN=V2_OUT={0x, 2y, 4x, 7x, 8z},
V
3_OUT=V3_IN={0x, 2y, 4x, 7x, 8z};
The set representation of the node V4 is denoted as:
V
4_IN=V3_OUT={0x, 2y, 4x, 7x, 8z},
V
4_OUT=(V3_IN\{0x, 4x, 7x})∪{4x}={2y, 4x, 8z};
The set representation of the node V5 is denoted as:
V
5_IN=V3_OUT∪V4_OUT={0x, 2y, 4x, 7x, 8z},
V
5_OUT=(V5_IN\{8z})∪{5z}={0x, 2y, 4x, 7x, 5z};
The set representation of the node V6 is denoted as:
V
6_IN=V5_OUT={0x, 2y, 4x, 7x, 5z},
V
6_OUT=V6_IN={0x, 2y, 4x, 7x, 5z};
The set representation of the node V7 is denoted as:
V
7_IN=V6_OUT={0x, 2y, 4x, 7x, 5z},
V
7_OUT=(V7_IN\{0x, 4x, 7x})∪{7x}={2y, 5z, 7x};
The set representation of the node V8 is denoted as:
V
8_IN=V6_OUT∪V7_OUT={0x, 2y, 4x, 7x, 5z},
V
8_OUT=(V8_IN\{5z})∪{8z}={0x, 2y, 4x, 7x, 8z};
The set representation of the node V9 is denoted as V9_IN=V1_OUT={0x, 2y, 4x, 7x, 8z}.
Through the above two rounds of iterations, the node elements contained in the sets of the nodes containing the tensor variables of the intermediate representation no longer change, and achieve fixed nodes. The set of the fixed nodes is defined as intermediate representations based on the sets of the nodes containing the tensor variables.
Referring to
The output node sets of the various nodes can be represented by the input node sets, so that only the relationship between the input node sets of the various nodes needs to be deduced.
A process of deducing the relationship between the input node sets of the various nodes of the computation graph shown in
V
1_IN={ };
V
2_IN=V1_IN∪(V3_IN\{V3, V5, V6})∪{V3};
V3_IN=V2_IN;
V
4_IN=V1_IN∪(V5_IN\{V3, V5, V6})∪{V5};
V5_IN=V4_IN;
V6_IN=V2_IN∪V4_IN.
A work stack of a node to be processed is constructed and saved:
The node elements contained in the work stack are initialized to be in a nonactivated state:
A stack top node element is popped out from the work stack, an input node set of the stack top node element is deduced by using the dependency relationship, and the input node set of the stack top node element obtained in a current round of iteration is updated:
The stack top node elements that depend the last step are added to a stack top position in sequence, the current work stack is updated, and the last step is repeated until the work stack is empty:
The last four steps include the following processes of iteratively deducing the fixed node sets based on the nodes containing the tensor variables:
In a first step, a work stack of a node to be processed is constructed and saved. The saved work stack of the node to be processed is constructed as [V1_IN, V2_IN, V3_IN, V4_IN, V5_IN, V6IN].
In a second step, the node elements contained in the work stack are initialized to be in a nonactivated state. The elements in the work stack are initialized to be in the nonactivated state marked by . Table 1 shows the states of the input node sets of the various nodes in the work stack.
In a third step, an element at a stack top of the work stack is processed. The processing of an element at a stack top of the work stack includes the following processes:
First, a stack top node element V1_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V1_IN of the work stack pops out of the stack. Since the input node set of a node V1_IN is an empty set, the node V1_IN is updated from the nonactivated state to an empty set state { }.
Second, node sets that depend on the popped-out node V1_IN are added to the work stack. The process of adding the node sets that depend on the popped-out node V1_IN to the work stack is as follows: since the sets that depend on the node V1_IN contain the node V2_IN and the node V4_IN, a dependent node set {V2_IN, V4_IN} is added to the stack top. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 2.
Third, a stack top node element V2_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that stack top node element V2_IN of the work stack pops out of the stack, and V2_IN={V3} is deduced according to V2_IN=V1_IN∪(V3_IN\{V3, V5, V6})∪{V3} and V1_IN={ }. Therefore, the node V2_IN is updated from the nonactivated state to the state {V3}.
Fourth, node sets that depend on the popped-out node V2_IN are added to the work stack. The process of adding the node sets that depend on the popped-out node V2_IN to the work stack is as follows: since the sets that depend on the node V2_IN contain the node V3_IN and the node V6_IN, a dependent node set {V3_IN, V6IN} is added to the stack top. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 3.
Fifth, a stack top node element V3_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V3_IN of the work stack pops out of the stack, and V3_IN={V3} is deduced according to V3_IN=V2_IN=V1_IN∪(V3_IN\{V3, V5, V6})∪{V3} and V1_IN={ }. Therefore, the node V3_IN is updated from the nonactivated state to the state {V3}.
Sixth, node sets that depend on the popped-out node V3_IN are added to the work stack. The process of adding the node sets that depend on the popped-out node V3_IN to the work stack is as follows: since the sets that depend on the node V3_IN contain the node V2_IN, the dependent node set {V2_IN} is added to the stack top. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 4.
Seventh, a stack top node element V2_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V2_IN of the work stack pops out of the stack, and V2_IN={V3} is deduced according to V2_IN=V1_IN∪(V3_IN\{V3, V5, V6})∪{V3} and V1_IN={ }. Since the set elements of the node V2_IN do not change, the node V2_IN is kept in an activated state {V3}.
Eighth, node sets that depend on the popped-out node V2_IN are added to the work stack. Since the set elements of the node V2_IN do not change, no node sets that depend on the node V2_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 5.
Ninth, a stack top node element V6_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V6_IN of the work stack pops out of the stack, and V6_IN={V3} is deduced according to V6_IN=V2_IN∪V4_IN and V2_IN={V3}. Therefore, the node V6_IN is updated from the nonactivated state to the state {V3}.
Tenth, node sets that depend on the popped-out node V6_IN are added to the work stack. Since there are no other nodes that depend on the node V6_IN, no node sets that depend on the node V6_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 6.
Eleventh, a stack top node element V4_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V4_IN of the work stack pops out of the stack, and V4_IN={V5} is deduced according to V4_IN=V1_IN∪(V5_IN\{V3, V5, V6})∪{V5} and V1_IN={ }. Therefore, the node V4_IN is updated from the nonactivated state to the state {V5}.
Twelfth, node sets that depend on the popped-out node V4_IN are added to the work stack. Since the set that depends on the node V4_IN contains the node V5_IN and the node V6_IN, a node set {V5_IN, V6_IN} is added to the stack top. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 7.
Thirteenth, a stack top node element V5_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V5_IN of the work stack pops out of the stack, and V5_IN={V5} is deduced according to V5_IN=V4_IN={V5}. Therefore, the node V5_IN is updated from the nonactivated state to the state {V5}.
Fourteenth, node sets that depend on the popped-out node V5_IN are added to the work stack. Since the set that depends on the node V5_IN contains the node V4_IN, a node set {V4_IN} is added to the stack top. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 8.
Fifteenth, a stack top node element V4_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V4_IN of the work stack pops out of the stack, and V4_IN={V5} is deduced according to V4_IN=V1_IN∪(V5_IN\{V3, V5, V6})∪{V5} and V1_IN={ }. Therefore, the node V4_IN is kept in an activated state {V5}.
Sixteenth, node sets that depend on the popped-out node V4_IN are added to the work stack. Since the set elements of the node V4_IN do not change, no node sets that depend on the node V4_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 9.
Seventeenth, a stack top node element V6_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V6_IN of the work stack pops out of the stack, and V6_IN={V3, V5} is deduced according to V6_IN=V2_IN∪V4_IN. Therefore, the node V6_IN is updated from the activated state to the state {V3, V5}.
Eighteenth, node sets that depend on the popped-out node V6_IN are added to the work stack. Since there are no other nodes that depend on the node V6_IN, no node sets that depend on the node V6_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 10.
Nineteenth, a stack top node element V4_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V4_IN of the work stack pops out of the stack, and V4_IN={V5} is deduced according to V4_IN=V1_IN∪(V5_IN\{V3, V5, V6})∪{V5} and V1_IN={ }. Therefore, the node V4_IN is kept in an activated state {V5}.
Twentieth, node sets that depend on the popped-out node V4_IN are added to the work stack. Since the set elements of the node V4_IN do not change, no node sets that depend on the node V4_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 11.
Twenty-first, a stack top node element V2_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V2_IN of the work stack pops out of the stack, and V2_IN={V3} is deduced according to V2_IN=V1_IN∪(V3_IN\{V3, V5, V6})∪{V3} and V1_IN={ }. Since the set elements of the node V2_IN do not change, the node V2_IN is kept in an activated state {V3}.
Twenty-second, node sets that depend on the popped-out node V2_IN are added to the work stack. Since the set elements of the node V2_IN do not change, no node sets that depend on node V2_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 12.
Twenty-third, a stack top node element V3_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V3_IN of the work stack pops out of the stack, and V3_IN is deduced according to V3_IN=V2_IN=V1_IN∪(V3_IN\{V3, V5, V6})∪{V3} and V1_IN={ }. Therefore, the activated state of the node V3_IN is kept in the state {V3}.
Twenty-fourth, node sets that depend on the popped-out node V3_IN are added to the work stack. The process of adding the node sets that depend on the popped-out node V3_IN to the work stack is as follows: since the set elements of the node V3_IN do not change, no node sets that depend on node V3_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 13.
Twenty-fifth, a stack top node element V4_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V4_IN of the work stack pops out of the stack, and, V4_IN={V5} is deduced according to V4_IN=V1_IN∪(V5_IN\{V3, V5, V6})∪{V5} and V1_IN={ }. Therefore, the node V4_IN is kept in an activated state {V5}.
Twenty-sixth, node sets that depend on the popped-out node V4_IN are added to the work stack. Since the set elements of the node V4_IN do not change, no node sets that depend on the node V4_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 14.
Twenty-seventh, a stack top node element V5_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V5_IN of the work stack pops out of the stack, and V5
Twenty-eighth, node sets that depend on the popped-out node V5_IN are added to the work stack. Since the set elements of the node V5_IN do not change, no node sets that depend on the node V5_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 15.
Twenty-ninth, a stack top node element V6_IN pops out of the work stack. The stack top node element pops out of the work stack, which refers to that the stack top node element V6_IN of the work stack pops out of the stack, and V6_IN={V3, V5} is deduced according to V6_IN=V2_IN∪V4_IN and V2_IN={V3}, as well as V4_IN={V5}. Therefore, the activated state of the node V6_IN is kept in the state {V3, V5}.
Thirtieth, node sets that depend on the popped-out node V6_IN are added to the work stack. Since there are no other nodes that depend on the node V6_IN, no node sets that depend on the node V6_IN are added to the work stack. After the above steps, the states of the input node sets of the various nodes in the work stack are updated as those shown in Table 16.
An intermediate representation in a fixed node state is implemented by using a bit vector:
Referring to Table 17, a bit vector representation of the intermediate representation in the fixed node state is shown.
Registers are allocated for tensor variables contained in nodes of the intermediate representation that achieves the fixed node state.
Corresponding to the foregoing embodiment of the optimization method for compiling a computation graph, the disclosure further provides an embodiment of an optimization apparatus for compiling a computation graph. Through the above optimization method for compiling a computation graph, a work stack for the intermediate representation is used to improve the utilization efficiency of compiling memory, reduce the required amount computer memory resources for running a neural network model, save the register resources of CPU cores which need to be allocated when the neural network model is running on the computer, and finally improve data training efficiency and data input and output efficiency of the neural network model.
Referring to
The embodiment of the optimization apparatus for compiling a computation graph of the disclosure can be applied to any device with data processing capability. Any device with data processing capability can be a device or apparatus such as a computer. The apparatus embodiment may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Software implementation is taken as an example, an apparatus in a logical sense is formed by reading corresponding computer program instructions in a nonvolatile memory into an internal memory through a processor of any device with the data processing capability where it is located. In terms of hardware, as shown in
For details of the implementation process of the functions and effects of all units in the above apparatus, the implementation processes of the corresponding steps in the above method are referred to, and repeated descriptions are omitted here.
For the apparatus embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for related parts. The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the disclosure. Those of ordinary skill in the art can understand and implement it without creative effort.
An embodiment of the disclosure further provides a computer-readable storage medium, which stores a program, wherein the program, when executed by a processor, implements the optimization method for compiling a computation graph in the above embodiment. By the program on the computer-readable storage medium, when executed by a processor, through the above optimization method for compiling a computation graph, a work stack for the intermediate representation is used to improve the utilization efficiency of compiling memory, reduce the required amount computer memory resources for running a neural network model, save the register resources of CPU cores which need to be allocated when the neural network model is running on the computer, and finally improve data training efficiency and data input and output efficiency of the neural network model.
The computer-readable storage medium may be an internal storage unit of any device with the data processing capability described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of any device with the data processing capability, such as a plug-in hard disk, a smart media card (SMC), an SD card, and a flash card. Further, the computer-readable storage medium may also include both an internal storage unit of any device with the data processing capability and an external storage device. The computer-readable storage medium is used for storing the computer program and other programs and data required by any device with the data processing capability, and can also be used for temporarily storing data that has been output or will be output.
The above descriptions are only preferred embodiments of the disclosure, and are not intended to limit the disclosure. For those skilled in the art, the disclosure can have various changes and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the disclosure shall all fall within the protection scope of the disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202211177796.9 | Sep 2022 | CN | national |
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/CN2022/124001 | Oct 2022 | US |
| Child | 17992814 | US |