Machine learning (ML) is becoming an increasingly important part of the computing landscape. Machine learning is a branch of artificial intelligence (AI) and ML helps enable a software system to learn to recognize patterns from data without being directly programmed to do so. Neural networks (NN) are a type of ML which utilize a set of linked and layered functions (e.g., nodes, neurons, etc.) which are weighted to evaluate input data. In some NNs, sometimes referred to as convolution neural networks (CNNs), convolution operations are performed in NN layers based on inputs received and weights rather than matrix multiplication used in traditional NN. Layers in CNNs may perform many types of functions; including, but not limited to, convolution, deconvolutional, pooling, up-sample, etc. CNNs are often used in a wide array of applications typically for recognition and classification, such as image recognition and classification, prediction and recommendation systems, speech and language recognition and translation, etc.
As ML becomes increasingly useful, there is a desire to execute complex ML techniques, such as NNs and CNNs, efficiently in devices with relatively limited compute and memory resources, such as embedded, or other low-power devices. To help efficiently run a given ML model, the ML mod& may be analyzed and optimized to tailor how the ML model is run to a target hardware resources to be used.
This disclosure relates to a technique for executing machine learning (ML) models. The technique includes receiving an indication to run a ML model, receiving synchronization information for organizing the running of the ML model with other ML models, determining, based on the synchronization information, to delay running the ML model, delaying the running of the ML model, determining, based on the synchronization information, a time to run the ML model; and running the ML model at the time.
Another aspect of the present disclosure relates to a non-transitory program storage device comprising instructions stored thereon to cause one or more processors to receive a set of ML models, simulating running the set of ML models on a target hardware to determine resources required by the ML models of the set of ML models and timing information, determining to delay running one or more ML models of the set of ML models based on the simulation, and generating synchronization information based on the determining.
Another aspect of the present disclosure relates to an electronic device, comprising a memory, and one or more processors operatively coupled to the memory, wherein the one or more processors are configured to execute instructions causing the one or more processors to receive an indication to run a machine learning (ML) model, receive synchronization information for organizing the running of the ML model with other ML models, determine, based on the synchronization information, to delay running the ML model, delay the running of the ML model, determine, based on the synchronization information, a time to run the ML model, and run the ML model at the time.
For a detailed description of various examples, reference will now be made to the accompanying drawings in which:
As ML has becoming more common and powerful, hardware configured to execute ML models has been introduced. As used herein, an ML model may refer to an implementation of one or more ML algorithms which model a behavior, such as object recognition, behavior of a circuit, behavior of a neuron, etc. In cases where a target hardware for executing ML models are known, the ML models may be optimized for the target hardware configurations to help enhance performance. For example, ML models for object recognition, low-light enhancement, and facial recognition may be optimized to execute on a particular a mobile device, such as a smartphone configured with a certain ML processor. As another example, ML models for object recognition, movement prediction, and behavioral prediction may be optimized to execute on specific hardware found in certain partially or fully self-driving automobiles.
In this example, first layer 106 represents a function based on a set of weights that are applied to the input parameters (e.g., input parameters 102 and 104) to generate output from first layer 106 that is input to the second layer 108. Different weights may be applied for the input received from each node of the previous layer by the subsequent layer. For example, for a node of the second layer 108, the node applies weights to input received from nodes of the first layer 106 and the node may apply a different weight to input received from each node of the first layer 106. Nodes compute one or more functions based on the inputs received and corresponding weights and outputs a number. For example, the node may use a linear combination function which multiplies an input values from a node of the previous layer with a corresponding weight and sums across the results of the multiplication, coupled with a non-linear activation function which acts as a floor for the resulting number for output. It may be understood that any known weighted function may be applied by the node within the scope of this disclosure. This output number may be input to subsequent layers, or if the layer is a final layer, such as third layer 110 in this example, the number may be output as a result (e.g., output parameters or ML model outputs 112).
In some cases, the functions applied by nodes of a layer may differ as between layers. In some cases, each layer may have different resource requirements. For example, different functions may have different loads on the processor. Additionally, some functions may have different input or output parameters and thus consume more, or less, memory space and bandwidth. These differing processor and memory loads may also influence an amount of energy to power the processor and memory, as well as an amount of heat generated.
After a ML model, such as neural network ML model 100, is defined with respect to nodes, layers, etc., the ML model may be trained. In some cases, the ML model 100 may be trained using a labelled data set corresponding to data to be input to ML model 100. For example, an object recognizer may be trained on images of objects. These images may include metadata labelling the object(s) in the image. The ML model 100 may be initiated with initial weights and the images input to the ML model 100 to generate predictions. The weights of the nodes may be adjusted based on how accurate the prediction is as compared to the labels. The weights applied by a node may be adjusted during training based on a loss function, which is a function that describes how accurately the predictions of the neural network are as compared to the expected results; an optimization algorithm, which helps determine weight settings adjustments based on the loss function; and/or a backpropagation of error algorithm, which applies the weight adjustments back through the layers of the neural network. Any optimization algorithm (e.g., gradient descent, mini-batch gradient descent, stochastic gradient descent, adaptive optimizers, momentum, etc.), loss function (e.g., mean-squared error, cross-entropy, maximum likelihood, etc.), and backpropagation of error algorithm (e.g., static or recurrent backpropagation) may be used within the scope of this disclosure.
In some cases, training the ML model 100 is performed during development of the ML model 100 and may be performed by a system or device separate from the system or device that runs the trained ML model.
The CPU cores 202 may be coupled to a crossbar (e.g., interconnect) 206, which interconnects and routes data between various components of the device. In some cases, the crossbar 206 may be a memory controller or any other circuit that can provide an interconnect between peripherals. Peripherals may include master peripherals (e.g., components that access memory, such as various processors, processor packages, direct memory access/input output components, etc.) and slave peripherals (e.g., memory components, such as double data rate random access memory, other types of random access memory, direct memory access/input output components, etc.). In this example, the crossbar 206 couples the CPU cores 202 with other peripherals, such as a ML accelerator 208 and other processing cores 210, such as a graphics processing unit, radio basebands, coprocessors, microcontrollers, etc., one or more shared memories 212, as well as external memory 214, such as double data rate (DDR) memory, dynamic random access memory (DRAM), flash memory, etc., which may be on a separate chip from the SoC. The shared memory 212 may include any type of memory, such as static random access memory (SRAM), flash memory, etc. The ML accelerator 208 may include one or more ML cores 216. The ML cores 216 may be processor cores configured to accelerate machine learning models. A runtime controller 218 for controlling ML model execution and interfacing between the ML model and the ML cores 216 may execute on the ML cores. The runtime controller 218 may be software based, for example, an operating system, kernel, and/or hypervisor. In some cases, the runtime controller 218 may include hardware configured to control and/or manage execution of ML models on one or more ML cores 216.
ML Model Preparation
Once a ML model 302 is trained, the ML model 302 may be compiled and translated for a target hardware by a ML model complier 304A, 304B, . . . 304n (collectively). In this example, the target hardware 306 is shown as a simplified version of the device shown in
After compilation of the ML model 302 to runtime code 316 for the target hardware 306, the parameters of the ML model 302 may be stored, for example, in the external memory 314. When a ML model 302 is executed, the runtime code and parameters 316 may be loaded, for example into shared memory 312 or other memory, such as a memory dedicated to a specific ML core 310, and executed by the ML core 310. In some cases, a particular ML model 302 may be executed by a ML core 310 of the ML cores 310. Multiple ML models may be executed concurrently across the multiple ML cores. In some cases, certain ML models may be designated to execute on certain cores of the target hardware. For example, a ML model which uses more processing resources may be assigned to execute on a certain ML core which may have an increased amount of processing power, or multiple ML models which use less processing resources may be designated to execute together on another ML core.
Running multiple ML models concurrently and asynchronously may be associated with a linear scaling of hardware resources needed to run the ML models as ML models are added. For example, each ML model may be associated with a certain memory throughput requirement to load parameters needed by the ML model from memory. As the number of ML models increases, the amount of memory throughput increases linearly with the requirements of the ML models. In accordance with aspects of the present disclosure, optimization techniques may help reduce an amount of additional hardware resources needed to execute multiple ML models concurrently.
In some cases, which ML models will be executed concurrently may be known in advance. For example, a predetermined set of ML models may be run concurrently for use by a camera or another predetermined set of ML models may be run concurrently for an autonomous vehicle. Additionally, hardware resources required by a ML model may vary depending on a portion of the ML model being executed at a particular moment. For example, a ML model such as a deep learning or neural network model may include a number of layers. The hardware resources, for example processor time, memory capacity and throughput, power, etc., required for each layer may be different. In some cases, the execution of the multiple ML models executing on two or more logical computing cores may be sequenced across to balance the hardware resources required by the ML models.
ML Model Execution Optimization
Resource requirements of the ML models 504 may be balanced, for example, by adjusting an execution order of the ML models and/or an amount of time to delay execution of one or more ML models or portions of one or more ML model. For example, where ML model 2 504B consumes a relatively large amount of resources in a number of initial layers and then consumes relatively less resources after the initial layers, execution of ML model 1 504A may be scheduled after ML model 2 504B has started so that a high resource consumption period of ML model 1 504A does not coincide with the high resource consumption period of ML model 2 504B.
Additionally, a timing skew may be determined for the ML models 504. The timing skew may indicate an amount of time to delay for starting one or more ML models or an amount of time to delay execution of layers in one or more ML models. In some cases, these delays may be placed before/after/between runs of ML models or between layers of a ML model. In timeline 500, an amount of the timing skew may be represented by a size of the synchronization point. Continuing the example above, a first coarse sync point 506A for ML model 2 504B may represent a minimal or no delay where ML model 2 504B may be initiated at time 0. At a certain time indicated by the width of a second coarse sync point 506B, ML model n 504N may be initiated at time CN. Then at a certain time as indicated by the width of a third coarse sync point 506C, ML model 1 504A may be initiated at time CN+1. In some cases, multiple types of synchronization points 506, 508 may be identified. For example, coarse synchronization points 506 may be identified at the beginning (or end) of the ML models and fine synchronization points 508 may be identified as between layers of the ML models 504. In some cases, the execution order and timing skew for coarse synchronization points 508 and a timing skew for fine synchronization 508 may be determined, for example when preparing the ML model for the target hardware. In some cases, execution order and timing skew may also be adjusted at runtime.
Multiple ML model execution may be dynamically coordinated based on context information and synchronization pattern information. In some cases, this coordination may be performed at a ML model level. For example, times at which certain ML models may be started on certain cores (e.g., for coarse synchronization points) may be provided as a part of the context and synchronization pattern information for dynamic coordination. Synchronization timing as between layers of the ML model may also be provided as pre-determined, fixed delays to be inserted between layers of the ML model. For example, a number and length of the delays as between layers of the ML model may be determined during the compilation and translation phase prior to execution and stored in a separate portion of the context information. This information may be accessed when executing the ML model, but not used to dynamically coordinate ML model synchronization. During execution of a ML model with inserted delays, when execution of the ML model reaches a fine synchronization point between layers, execution of the ML model may be delayed for the amount of time indicated in the inserted delay.
As shown in table 600, for dynamic coordination, a core number and ML model number may be associated with a time, indicating at which time a ML model should be run on which core. For example, a runtime controller may determine that a coarse synchronization point at the beginning of the ML models are been reached. In some cases, the runtime controller may determine when to start a ML model based on the context information and a current context index. Here, time 0 may represent an initial time value, such as a clock counter value. In some cases, a current context index may be incremented after each time is reach. The current context index can help track where execution is at in the context information and may be used to help determine a time value associated with the context information. The current context index may be initialized to point to the first entry of the context information corresponding to time 0. At time 0, execution of the first ML model may start. In this example, at time 0, ML model 2 504B may be the first ML model, of the ML models, started on core 2 502B. The current context index may be updated to point to time CN. A comparison of the current time and time CN may be performed and if the current time is less than CN, then starting the next ML model is delayed. At time CN, ML model 1 504A may be started on core 1 502A and the current context index may be updated to point to time CN+1. A comparison of the current time and time CN+1 may be performed and if the current time is less than CN+1, then starting the next ML model is delayed. At time CN+1, ML model N 504N may be started on core N 502N and the current context index may be updated to point to CN+2.
In some cases, the synchronization pattern information may also include synchronization information at a ML model layer level for coordinating ML models. Table 650 illustrates a variation of table 600 including layer coordination timings. Table 650 may be used in a manner similar to table 600 to help the runtime controller dynamically adjust timings for both coarse and fine synchronization points. Initially, at time 0, execution of the first ML model, here ML model 1 504A may be started at layer 1 on a first core 502A. As execution continues, a runtime controller may determine that a fine synchronization point located between certain layers of a ML model has been reached. Where the current context index is incremented after each time is reached, the current context index, in this example, would point to time CN and time CN is the next time value after the initial time. This next time value may be compared to a current time value. If the current time is less than the next time value, then execution of the ML model layer associated with the next time value, here layer 2 of ML model 2 504B may be delayed at the reached fine synchronization point until the current time matches CN. Execution of ML model 2 504B on core 2 502B may then proceed when the current time matches or exceeds CN. The current context index may be incremented to point to CN+1 and when execution of ML model N 504N reaches a next fine synchronization point at layer 5, the current time value is compared to time CN+1 to determine whether to delay execution of layer 5 of ML model N 504N on core N 502N, or continue execution in the same manner as discussed above with respect to time CN0. In some cases, synchronization at a beginning of a ML model (e.g., coarse synchronization) may be indicated by a layer number set to the first layer of the ML model, such as shown at time CN+3. Synchronization of ML models may continue in such a manner until T number of entries in the context information is reached. In some cases, after T number of entries are reached, the current context index is moved to the first entry of the context information. A new initial time value may be determined, and synchronization of the ML models based on the context information may be repeated.
In some cases, executing a ML model on simulated target hardware and based on simulated inputs may not exactly match execution of the ML model on the target hardware with real-world inputs. In some cases, a ML model may execute in less time than expected. In such cases, execution of the ML model may be delayed as discussed above. In other cases, execution of the ML model may take longer than expected and adding additional delays in such case may be skipped. In cases where execution of the ML model takes much longer than expected, such as if, for example, execution of ML model 2 504B reaches the synchronization point associated with time CN+2 after time CN+2 has elapsed, additional flexibility in the synchronization of the ML models may be provided. In accordance with aspects of the present disclosure, a leaky bucket scheme may be used to provide a more flexible synchronization scheme to help optimize performance.
If, at step 704, the core number of the current core and a ML model number of the ML model currently executing on the current core match the core and ML model numbers read from the context information, then the current time is obtained and a difference between then the time read from a current context index of the context information (Cx) and the current time is determined at step 708. At step 710, if the time read from the context information (Cx) is greater than the current time, then at step 712, a delay equal to the difference between the time read from the context information (Cx) and the current time is set, the next context index is read in preparation for advancing the current context index, and the corresponding ML model is started or continued on the core after the delay. If the time read from the context information (Cx) is less than the current time, execution of the ML model on the core is occurring slower than expected. To help expedite execution, at step 714, the next entry for the core and ML model numbers are found and the corresponding time is updated based on the difference between the time read from the context information (Cx) and the current time. Execution of the corresponding ML model is started, or continued, on the core, skipping the delay.
As described above, a ML model runtime controller of a core may update the context information based on the when execution of a ML model reaches a synchronization point as compared to when the ML model was expected to reach the synchronization point. To help avoid conflict issues where multiple cores attempt to update the context information at once, a core may lock write access to the context information when the core is attempting to update the context information.
Similarly, core N 806B may be executing a ML model and hit a synchronization point 818 between layers x and x+1. Core N 806B may read the context information to determine whether the current context index is associated with ML model x and core N 806B. If the current context information is associated with ML model x and core N 806B, the core N 806B may request a write lock 820 on the context information. After the core N 806B receives an indication that the write lock 820 was placed successfully, the core N 806B may update the context information based on the results of the leaky bucket optimization scheme and unlock 822 the context information for writing.
In some cases, the context information and synchronization pattern information may be determined as a part of preparing the ML models to execute on target hardware.
To help the multi-core orchestrator 906 determine where to insert delays and how long delays should be, the multi-core orchestrator 906 simulates the ML models 902 to characterize the ML models 902. In some cases, the ML models may each be characterized on a layer by layer basis to determine, for each layer, an amount of time needed to execute the layer, an amount of external memory bandwidth needed to execute the layer, an amount of power needed to execute the layer, an amount of needed to execute the layer, and an amount of bandwidth needed to execute the layer.
In some cases, multi-core orchestrator 906 may determine where to insert delays and how long delays should be for the synchronization pattern information 912 and context information 910 based on one or more cost functions. After characterizing the ML models, the multi-core orchestrator 906 may then insert a set of one or more delays in between certain layers and/or delay the start times of certain ML models and evaluate the inserted delays based on one or more cost functions.
In some cases, a first cost function may apply a certain weight to each of the constraints. For example, a first weight may be applied to the amount of external memory bandwidth needed, a second weight applied to the amount of power needed, a third weight applied to the amount of needed, and a fourth weight applied to amount of bandwidth needed. A second cost function may be a sum of all of the delays being applied across all of the cores. This second cost function may be minimized to help avoid adding delays which slow down execution of the ML models. Other cost functions may also be used, including per-core cost functions, such as a sum of all delays introduced on a per-core basis.
For a set of inserted delays, a value of each cost function may be determined. In some cases, an overall cost value for the set of inserted delays may also be determined. The overall cost value may be determined based on applying weights to each cost function and then summing the weighted cost functions. Another set of one or more inserted delays may be determined and the steps repeated to determine values for each cost function and overall cost value. In some cases, sets of one or more inserted delays may be repeatedly evaluated exhaustively and from this exhaustive set, a set of inserted delays which minimizes the overall cost value may be selected and used to generate the context information 910 and synchronization pattern information 912. The context information 910 and synchronization pattern information 912 may be stored on a computer-readable medium for use with the ML models.
After generation, the context information 910 and synchronization pattern information 912 may be stored in an external memory 916 of the target hardware 908. In some cases, the context information 910 and synchronization pattern information 912 may be stored as a part of, or with, other optimization runtime code 918. The external memory 916 is coupled to a SoC 920 and at runtime of the ML models, the context information 910 and synchronization pattern information 912 may be loaded from the external memory 916 into a shared memory 914 of the SoC 920 of the target hardware 908. In some cases, one or more portions of the context information 910 and synchronization pattern information 912 may also be stored or accessed from an external memory 916 of the target hardware 908 during runtime. Runtime code 928 for the ML models on the target hardware 908 may also be stored in the external memory 916. During runtime of the ML models, the execution of the ML models may be controlled by runtime controllers 924 on the cores 922 of the SoC 920. A timing manager 926 of the runtime controller 924 may be provided to detect synchronization points, start, pause, and resume execution of ML models, and/or perform leaky bucket optimization determinations.
At block 1004, a first cost value based on a first cost function may be determined based on the information related to the execution of the set of ML models. In some cases, this cost function may be based on the information related to the execution of the ML models. In some cases, the cost function may be a weighted sum of the information related to the execution of the ML model. For example, the ML models may be simulated with a certain set of delay timings applied to the ML models and a set of information determined, such as maximum amount of external memory bandwidth (B), power consumed (P), internal memory throughput (T), and amount of internal memory (S) used may be determined. Each type of information may be weighted (W), for example, by multiplying by a weight and then summed to obtain the first cost value. In some cases, each type of information may have a different weight. Thus, a first cost function to obtain a first cost value (C1) may be, in this example, C1=W1*B+W2*P+W3*T+W4*S. In some cases, the weights may be constrained. For example, sum of the weights may be equal to 1 (e.g., W1+W2+W3+W1=1).
At block 1006, a second cost value based on a second cost function may be determined. In some cases, this second cost function may be based on a sum of the applied delays. For example, the second cost function determine the second cost value (C2) by summing an amount of delay time applied to all of the synchronization points of the ML models added across all of the cores. In another example, the second cost function may determine the second cost value (C2) by summing an amount of delay time applied to all of the synchronization points of the ML models that would execute on particular cores of the target hardware (e.g., delay times applied for ML model(s) executing on core 1).
At block 1008, an overall cost value may be determined based on the first cost value and the second cost value. For example, the first cost value and the second cost value may be weighted (Wc) and then summed such that the overall cost for a particular set of delay timings are Wc1*C1+Wc2*C2. At block 1010, overall costs for additional sets of delay timings may be looped through. In some cases, the orchestrator may simulate the execution of the set of ML models based on a set of parameters, such as a delay range or a maximum delay for coarse and/or fine synchronization points. The orchestrator may perform an exhaustive set of simulations, for example over each delay timing and combination of delay timings for the synchronization points, of a set of delay timings, within the delay range or below a maximum delay, for synchronization points of the ML models and overall cost values determined for each set of delay timings.
After overall cost values are determined for each set of delay timings, at block 1012, a set of delay timings associated with a minimum overall cost value is determined and at block 1014, the determined delay timings associated with the minimum overall cost value may be output. For example, the determined delay timings may be used for the synchronization pattern and/or context information.
In this description, the term “couple” may cover connections, communications, or signal paths that enable a functional relationship consistent with this description. For example, if device A generates a signal to control device B to perform an action: (a) in a first example, device A is coupled to device B by direct connection; or (b) in a second example, device A is coupled to device B through intervening component C if intervening component C does not alter the functional relationship between device A and device B, such that device B is controlled by device A via the control signal generated by device A.
Modifications are possible in the described embodiments, and other embodiments are possible, within the scope of the claims.