Recently, an electronic device such as a personal computer, a notebook or a cell phone generally has two or more processors for executing different tasks. However, when a complex model such as an artificial intelligence (Al) model is running, and the accurate and fast results are required, how to use the processors to execute the model becomes a problem. For example, if the electronic device comprises a central processing unit (CPU), a graphics processing unit (GPU) and a vision processing unit (VPU), and only the CPU is arranged to execute the whole model, the GPU and the VPU may not be fully utilized, and it may cause the CPU to be overloaded and the processing time may be too long. In another case, most of the models may be executed by one processor such as CPU, and the other processors are only used to execute the operations that are not supported by the CPU, however, the processors may have much idle time, and the processors need to synchronize the intermediate result.
It is therefore an objective of the present invention to provide a runtime hyper-heterogeneous processes optimization method, to solve the above-mentioned problems.
According to one embodiment of the present invention, an electronic device comprising a plurality of processing circuits is disclosed, wherein the apparatus comprises a circuitry configured to perform the steps of: receiving a model and input data for execution; analyzing the model to obtain a graph partition size of the model; partitioning the model into a plurality of graphs based on the graph partition size, wherein each of the graphs comprises a portion of operations of the model; deploying the plurality of graphs to at least two of the processing circuits, respectively; and generating output data according to results of the at least two of the processing circuits executing the plurality of graphs.
According to another embodiment of the present invention, a machine-readable storage medium comprising program codes is disclosed, wherein when the program codes are executed by a processor, the processor performs the steps of: receiving a model and input data for execution; analyzing the model to obtain a graph partition size of the model; partitioning the model into a plurality of graphs based on the graph partition size, wherein each of the graphs comprises a portion of operations of the model; deploying the plurality of graphs to at least two of the processing circuits, respectively; and generating output data according to results of the at least two of the processing circuits executing the plurality of graphs.
These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
Certain terms are used throughout the following description and claims to refer to particular system components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms “including” and “comprising” are used in an open-ended fashion, and thus should be interpreted to mean “including, but not limited to . . . ”. The terms “couple” and “couples” are intended to mean either an indirect or a direct electrical connection. Thus, if a first device couples to a second device, that connection may be through a direct electrical connection, or through an indirect electrical connection via other devices and connections.
The platform 120 has a hyper-heterogeneous optimization (H2O) engine 122, wherein the H2O engine 122 is used to analyze the model 110 and/or train the model 110, so that the model 110 can be partitioned into at least one computational graph that is/are to be executed by at least one processing circuit. Specifically,
In Step 206, the H2O engine 122 estimates the model 110 to predict the inference time and memory usage if the model 110 is executed by one or more processing circuits. For example, the H2O engine 122 can use a gradient boosting model such as a LGBM model to estimate the model 110 to obtain the estimated memory usage and predicted execution time (Step 208). The LGBM model is an artificial intelligence algorithm skill that can be used for all kinds of modeling and prediction. Specifically, for an operation run by a processing circuit, the LGBM model can sequentially establish many trees, wherein a first tree outputs a first result based on input data, a second tree is used to leverage the first tree and outputs a second result based on the input data, a third tree is used to leverage the second tree and outputs a third result based on the input data, a fourth tree is used to leverage the third tree and outputs a fourth result based on the input data, . . . etc. Then, the performance prediction of the operation run by the processing circuit can be obtained by adding the multiple results of the trees. Because the LGBM model is open source distributed, and a person skilled in the art should understand the operations of the LGBM model, further descriptions of the LGBM model are omitted here.
Step 210, the H2O engine 122 determines if the bandwidth is capable for the using two or more processing circuits to execute the model 110. If yes, the flow enters Step 212; and if not, the flow enters Step 216.
In step 212, the H2O engine 122 determines the graph partition size for two or more processing circuits, that is, the H2O engine 122 can determine the workloads or number of operations of two or more processing circuits. In Step 214, the H2O engine 122 generates the graph for two or more processing circuits. For example, if the CPU 132 and the GPU 134 are determined to run the model 110, the H2O engine 122 may partition the model 110 into a first graph and a second graph, and the CPU 132 will run the operations of the first graph, while the GPU 134 will run the operations of the second graph.
In Step 216, the H2O engine 122 does not deploy the model 110 to more processing circuits, that is, the model 110 may only be executed by the CPU 132.
In the embodiment shown in
In Step 306, the H2O engine 112 determines if the model 110 has been analyzed and predicted by the H2O engine 112 before. If yes, the flow enters Step 318; and if not, the model 110 can be regarded as an unknown model, and the flow enters Step 308.
In Step 308, the H2O engine 122 estimates the model 110 to predict the inference time and memory usage. For example, the H2O engine 122 can use a gradient boosting model such as a LGBM model to estimate the model 110 to obtain the estimated memory usage and predicted execution time (Step 310). Because the LGBM model is open source distributed, and a person skilled in the art should understand the operations of the LGBM model, further descriptions of the LGBM model are omitted here.
Step 312, the H2O engine 122 determines if the bandwidth is capable for the using two or more processing circuits to execute the model 110. If yes, the flow enters Step 314; and if not, the flow goes back to Step 300.
In Step 318, the H2O engine 122 determines if a prediction error of the model 110 is greater than a blacklist threshold, if yes, the flow enters Step 326; and if not, the flow enters Step 320. Specifically, because the model 110 is previously estimated and executed, the H2O engine 122 can exactly know a difference between the previous estimated execution time and the previous actual execution time of the model 110, and the H2O engine 122 can also exactly know a difference between the previous estimated memory usage and the previous actual memory usage when the model 110 is executed previously, wherein the above differences, alone or in combination, can be regarded as the prediction error of the model 110.
In Step 320, the H2O engine 122 determines if the prediction error of the model 110 is less than a whitelist threshold, if yes, the flow enters Step 322; and if not, the flow enters Step 324. In this embodiment, the blacklist threshold is greater than the whitelist threshold.
In Step 322, the H2O engine 122 uses the previous determined graph partition size for two or more processing circuits, that is, if the model 110 previously executed by the processing circuits is partitioned into a first graph with a first size and a second graph with a second size, now the H2O engine 122 also partition the model 110 into the first graph with the first size and the second graph with the second size.
In Step 324, the H2O engine 122 tunes the graph partition size based on the previous determined graph partition size and the prediction error of the model 110 as described in Steps 318 and 320, to generate an updated graph partition size. For example, if the actual execution time of a graph is greater than the previous predicted execution time, the H2O engine 122 may reduce the size of this graph to shorten the execution time.
In step 314, the H2O engine 122 determines the graph partition size for two or more processing circuits, that is, the H2O engine 122 can determine the workloads or number of operations of two or more processing circuits. In this embodiment, the H2O engine 122 can use the off-line tuned graph partition size, the previous determined graph partition size in Step 322, or the updated graph partition size in Step 324. In Step 316, the H2O engine 122 generates the graph for two or more processing circuits. For example, if the CPU 132 and the GPU 134 are determined to run the model 110, the H2O engine 122 may partition the model 110 into the first graph and the second graph, and the CPU 132 will run the operations of the first graph, while the GPU 134 will run the operations of the second graph.
In Step 326, the H2O engine 122 does not deploy the model 110 to more processing circuits, that is, the model 110 may only be executed by the CPU 132.
In the embodiment shown in
In addition, every time the model 500 is to be executed, the H2O engine 122 can tune the graph partition size to optimize the workloads of the operations 502_1-516_1 and 502_2-516_2, so that the GPU 134 and DLA 138 can execute the model 500 more efficient. For example, the execution times of the GPU 134 and DLA 138 may be 96.457 milliseconds (ms) and 124.219 ms, respectively, when the model 500 is executed for the first time; the execution times of the GPU 134 and DLA 138 may be 98.383 ms and 116.894 ms, respectively, when the model 500 is executed for the second time; the execution times of the GPU 134 and DLA 138 may be 100.323 ms and 109.009 ms, respectively, when the model 500 is executed for the third time; and the execution times of the GPU 134 and DLA 138 may be 101.572 ms and 101.955 ms, respectively, when the model 500 is executed for the fourth time. As the execution times of the GPU 134 and DLA 138 will get closer to closer, the overall execution time of the system will become shorter.
Briefly summarized, in the H2O engine of the present invention, an artificial intelligence algorithm is used estimate the unknown model to predict the performance of two or more processing circuits to obtain a graph partition size, for generating two or more graphs to be simultaneously executed by the two or more processing circuits, respectively. Furthermore, the artificial intelligence algorithm is also used tune the graph partition size to optimize the workloads of two or more processing circuits every time the model is to be executed, so that the model will become more and more efficient in execution.
Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.
This application claims the priority of U.S. Provisional Application No. 63/063,992 (filed on Aug. 11, 2020), which is included herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63063992 | Aug 2020 | US |