This U.S. patent application claims priority under 35 U.S.C. § 119 to: Indian Patent Application No. 202221043520, filed on Jul. 29, 2022. The entire contents of the aforementioned application are incorporated herein by reference.
The disclosure herein generally relates to pruning deep neural network (DNN), and, more particularly, to method and system for jointly pruning and hardware acceleration of pre-trained deep learning models.
In recent trends, artificial intelligence (AI) and machine learning (ML) techniques are emerging towards embedded system-based applications such as smartphones, consumer electronic devices, smart vehicles, and thereof providing advanced and personalized features. In ML models specifically deep neural networks (DNNs) have recently enabled unprecedented levels of intelligence on numerous systems providing effective applications in a broad spectrum of domains such as computing vision, healthcare, autonomous driving, machine translation, and many others. Though automated feature engineering exists for deep learning models to a large extent, building complex models require extensive domain knowledge or huge infrastructure for employing techniques such as neural architecture search (NAS).
In many industrial applications, there is a requirement of in-premises decision to sensors, which makes deployment of deep learning models on edge devices a desirable option. As a replacement, designing application specific deep learning models from scratch, where transformation of already built models can be achieved speedier with reduced cost. In such scenarios, an efficient DL model search approach is required to select from a pre-trained deep learning models which further schedules inference workload on heterogeneous computing platforms used in edge devices.
In existing approaches, most of the resource-constrained devices used in Industrial Internet of Things (IIoT), robotics, Industry 4.0, and thereof lacks with features in DNN models, accuracy, inference latency on the edge hardware configurations suitable for business requirements. In addition, porting the relevant DNN models to a new hardware requires decision making skills with faster time. Moreover, optimizing already ported DNN models are resource-constrained having ensemble embedded targets which poses several other challenges.
In another existing approaches such as Cyber-Physical Systems (CPS) and edge computing scenarios, the target hardware configurations are less powerful in processors, memory, and battery when compared to smartphones and other enterprise hardware. Moreover, these IoT devices are often made of relatively robust and resilient hardware, with a wide range of connectivity and input/output options. Due to the fact, it is not always feasible to change the hardware for accommodating the new DNN model based on inference in such edge devices. Selecting deployment hardware based on the DNN inference workload, multiple dependencies with many different stakeholders, mandatory testing cycles, and tight schedules makes it difficult to completely replace an existing edge hardware setup. In another approach, automated transformation required a route through the model architecture dynamically composed of different network operations making a series of decisions using a reinforcement learning. However, such approach requires training data to redesign the model.
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one embodiment, a system for jointly pruning and hardware acceleration of pre-trained deep learning models is provided. The processor implemented system is configured by the instructions to receive from a user, a pruning request comprising of (i) a plurality of deep neural network (DNN) models, (ii) a plurality of hardware accelerators comprising of one or more processors, a plurality of target performance indicators comprising of a target accuracy, a target inference latency, a target model size, a target network sparsity, and a target energy, and (iii) a plurality of user options comprising of a first pruning search, and a secondary pruning search. The plurality of DNN models and the plurality of hardware accelerators are transformed into a plurality of pruned hardware accelerated DNN models based on at least one of the user options. The first pruning search option executes a hardware pruning search technique, to perform search on each DNN model and each processor based on at least one of a performance indicator and an optimal pruning ratio. The second pruning search option executes an optimal pruning search technique, to perform search on each layer with corresponding pruning ratio. Further, an optimal layer associated with the pruned hardware accelerated DNN model is identified based on the user option. The layer assignment sequence technique creates a static load distributor by partitioning the optimal layer of the DNN model into a plurality of layer sequences and assigning each layer sequence to corresponding processing element of hardware accelerators.
In another aspect, a method for jointly pruning and hardware acceleration of pre-trained deep learning models is provided. The method includes receiving from a user, a pruning request comprising of (i) a plurality of deep neural network (DNN) models, (ii) a plurality of hardware accelerators comprising of one or more processors, a plurality of target performance indicators comprising of a target accuracy, a target inference latency, a target model size, a target network sparsity, and a target energy, and (iii) a plurality of user options comprising of a first pruning search, and a secondary pruning search. The plurality of DNN models and the plurality of hardware accelerators are transformed into a plurality of pruned hardware accelerated DNN models based on at least one of the user options. The first pruning search option executes a hardware pruning search technique, to perform search on each DNN model and each processor based on at least one of a performance indicator and an optimal pruning ratio. The second pruning search option executes an optimal pruning search technique, to perform search on each layer with corresponding pruning ratio. Further, an optimal layer associated with the pruned hardware accelerated DNN model is identified based on the user option. The layer assignment sequence technique creates a static load distributor by partitioning the optimal layer of the DNN model into a plurality of layer sequences and assigning each layer sequence to corresponding processing element of hardware accelerators.
In yet another aspect, a non-transitory computer readable medium for receiving from a user, a pruning request comprising of (i) a plurality of deep neural network (DNN) models, (ii) a plurality of hardware accelerators comprising of one or more processors, a plurality of target performance indicators comprising of a target accuracy, a target inference latency, a target model size, a target network sparsity, and a target energy, and (iii) a plurality of user options comprising of a first pruning search, and a secondary pruning search. The plurality of DNN models and the plurality of hardware accelerators are transformed into a plurality of pruned hardware accelerated DNN models based on at least one of the user options. The first pruning search option executes a hardware pruning search technique, to perform search on each DNN model and each processor based on at least one of a performance indicator and an optimal pruning ratio. The second pruning search option executes an optimal pruning search technique, to perform search on each layer with corresponding pruning ratio. Further, an optimal layer associated with the pruned hardware accelerated DNN model is identified based on the user option. The layer assignment sequence technique creates a static load distributor by partitioning the optimal layer of the DNN model into a plurality of layer sequences and assigning each layer sequence to corresponding processing element of hardware accelerators.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Embodiments herein provide a method and system for jointly pruning and hardware acceleration of pre-trained deep learning models. The system may be alternatively referred as a deep neural network (DNN) model fitment system. The method disclosed enables pruning a plurality of DNN models layers using an optimal pruning ratio. The system 100 has two integrated stages,
Referring now to the drawings, and more particularly to
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random-access memory (SRAM) and dynamic-random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure. In other words, input(s) fed at each step and output(s) generated at each step are comprised in the memory 102 and can be utilized in further processing and analysis.
The hardware accelerated DNN model pruner 202 of the system 100 is equipped to switch between each user selected option to process each pruning request received from the user. The user option comprises the first pruning search option and the second pruning search option. The pruning request comprises the plurality of DNN models, a plurality of hardware accelerators, a plurality of target performance indicators and a plurality of user options. Each pruning request is processed individually to obtain a pruned hardware accelerated DNN model.
In one embodiment, for accelerating each of the DNN model accelerator standard development kits (SDKs) convert each DNN model for the corresponding DNN accelerator hardware. For instance, the following command converts a standard tensor flow model to the hardware accelerated version for the coral edge tensor processing unit(s) (TPU) “edgetpu_compiler DNN_model_name.tflite”.
The optimally transformed DNN model 204 of the system 100 captures and records a plurality of pruned hardware accelerated DNN models for layer splitting. Pruning each DNN model refers to setting certain parameters to zero, which increases a sparsity of the DNN network. In most cases, pruning reduces an inference latency of the DNN network. However, a latency reduction is associated with cost which reduces the inference accuracy. The latency and the inference latency are inversely proportional with higher pruning ratio, and lower. Further, the two different classes of pruning are namely, an unstructured pruning and a structured pruning.
Unstructured pruning discards weights throughout each of the DNN model, based on a random rule and a magnitude rule which causes minimal accuracy loss. Unstructured pruning results each pruned DNN model with sparse weight matrices. The standard deep learning runtime libraries are designed to work on dense matrices, and the sparsity of such DNN model are with higher inference acceleration. Structured pruning in a convolutional layer improves the latency and removal of complete filters results with inference speedup but causes larger drop in inference accuracy.
The layer splitter and profiler 206 of the system 100 identifies an optimal layer associated with the pruned hardware accelerated DNN model based on the user option. It is noted when an optimally pruned and accelerated DNN model is generated for a particular processor or processing element, the individual layers are splitted for workload balancing. Most of the standard deep learning libraries provides a method to split the DNN model graph into the individual layers.
The layer sequencer 208 of the system 100 partitions the optimal layer of each DNN model into a plurality of layer sequences.
The static load distributor 210 of the system 100 distributes each layer sequence for optimal layer mapping. The present disclosure is further explained considering an example, where the system 100 determines at least one of a pruned hardware accelerated DNN model using the systems of
Referring now to the steps of the method 300, at step 302, the one or more hardware processors 104 receive a pruning request comprising a plurality of deep neural network (DNN) models, a plurality of hardware accelerators comprising of one or more processors, a plurality of target performance indicators and a plurality of user options. The plurality of target performance indicators comprises of a target accuracy, a target inference latency, a target model size, a target network sparsity, and a target energy. Each target performance indicator acts as a baseline value. The plurality of user options comprises of the first pruning search and the secondary pruning search.
Considering an example, where the DNN model fitment system 100 may receive the pruning request as an input from user(s). The pruning request is processed by the DNN model fitment system 100 by jointly pruning and hardware accelerations. The pruning request is processed based on the selected user option which outputs a transformed pruned hardware accelerated DNN models. The example ResNet 18 architecture have layer wise probability distribution of ResNet18 pruned variants using the first pruning search option.
Referring now to the steps of the method 300, at step 304, the one or more hardware processors 104 transform the plurality of DNN models and the plurality of hardware accelerators into a plurality of pruned hardware accelerated DNN models based on at least one user option. The user options are referred as the pruning search techniques which manages to increase or decrease the pruning level to prune each DNN model as much as possible without losing the inference accuracy. One such handle is the pruning ratio (Δ), which when increased, increases the pruning level, and vice versa.
Search space defines the pruning ratio ΔE[(0,1)] which quantifies the pruning level and may comprise a coarse-grained pruning search or a fine-grained pruning search space ∈d, depending on a step size between the pruning ratios. The coarse-grained pruning search space is defined with a fixed step size for example 0.1. Such fixed step size can result in the pruning search space with {0.1, 0.2, . . . , 0.9}. The fine-grained pruning search space where the pruning ratio Δ can take any value s between 0 and 1 with the step size that ranges up to 10−3. It is to be noted that experimental evaluations were performed using the coarse-grained pruning search space and the fine-grained pruning search space.
From each pruning search space, the number of combinations is estimated to obtain an optimal set of pruning ratios for each DNN model. The coarse-grained space is represented as c∈d, considering the |c|L combinations to find the optimal pruning ratios that achieve maximum sparsity, without accuracy loss. As an example, if |c|=1000 values, {0.01, 0.02, 0.03, . . . , 0.98, 0.99}, and pruning with a dense convolutional network (DenseNet-161) model with 161 layers and experimenting with 100161 different combinations. Moreover, each such combination essentially evaluated with the accuracy of the pruned DNN model on the test dataset, which incurs additional cost.
It may be noted that the above example illustrates an ideal pruning ratio to find the best combination of Δvalues for all layers of each DNN model. However, the standard deep learning libraries provides a global pruning strategy to search and specify a Δvalue from only one , valid for the whole DNN model. This approach is faster, and the Δvalue found through this method is often suboptimal.
To process the pruning request, the hardware accelerated DNN model pruner 202 of the system 100 identifies the user option for processing to transform the plurality of DNN models and the plurality of hardware accelerators into a plurality of pruned hardware accelerated DNN models. If the user option is the first pruning search option, then the system triggers the hardware pruning search technique, to perform search on each DNN model and each processor based on at least one of a performance indicators and an optimal pruning ratio. The referred user option is pertinent for example scenarios where there is limited time for model fitment, such as a quick prototyping in a software service industry. This technique performs a heuristic search to identify good sparsity values at a faster rate and without a drop in at least one of the performance indicator and inference accuracy. It is noted that that the method of the present disclosure can be easily integrated in available standard deep learning libraries for finding most optimal hyperparameters automatically.
In one embodiment, the hardware pruning search technique (Table 1) performs the steps by initializing each DNN model, a maximum pruning ratio with an upper bound value, an initial pruning ratio, a step size updated with a set of entities, a global pruning library, at least one of accelerating elements, a number of processing elements, a maximum resolution, and a plurality of first performance indicators comprising of an accuracy, an inference, a latency, a model size, and an energy of the hardware accelerated DNN models. Further, the method computes at least one of the first performance indicator value by pruning using the pruning ratio and accelerating corresponding DNN model. When a change is observed on at least one of the first performance indicator value matching with corresponding target performance indicator value a revised pruning ratio is updated. Recording the revised pruning ratio when at least one of the first performance indicator value is nearest to the target performance indicator value and modifying the step size.
The hardware pruning search technique provides the global pruning search that achieves maximum pruning, considering the standard development kit acceleration, and at the same time preserves accuracy. It uses the fine-grained search space for finding the optimal pruning ratio Δ. To find the optimal pruning ratio Δ, the following steps are performed in above Table 1,
Step 1: Initializing the pruning ratio Δ and the step size update ∈(0,1). Define an upper bound for A MAX_PR.
Step 2: Accelerate the model using the corresponding SDK of a participating processing element and find out its accuracy Aprune.
Step 3: If there is an accuracy drop due to the SDK acceleration for any of the accelerated models (lines: 9-10 of Table 1), provides the pruning ratio revision by using a less aggressive update.
Step 4: If there is no accuracy drop due to pruning and subsequent acceleration, storing the pruning ratios as the best and incrementing by the step size.
Step 5: However, if there is an accuracy drop for any of the accelerated models, revert the pruning ratios to last best-known value, decreasing the step size by half.
Step 1: Initializing the pruning ratio Δ and the step size update ∈(0,1). Define an upper bound for A MAX_PR in line 2 of Table 1.
Step 2: Accelerate the model using the corresponding SDK of a participating processing element and find out its accuracy Aprune in lines 7-9 of Table 1.
Step 3: If there is an accuracy drop due to the SDK acceleration for any of the accelerated models (lines: 9-10), provides the pruning ratio revision by using a less aggressive update in lines 20-22 of Table 1.
Step 4: If there is no accuracy drop due to pruning and subsequent acceleration, storing the pruning ratios as the best and incrementing by the step size in lines 14-17 of Table 1.
Step 5: However, if there is an accuracy drop for any of the accelerated models, revert the pruning ratios to last best-known value, decreasing the step size by half.
It has been experimentally evaluated and is represented in
In another embodiment, to process the pruning request with the second pruning search option the optimal pruning search technique is triggered to perform search on each layer with corresponding pruning ratio. Eventually, for scenarios having enough time and resources are available to perform search, for example a full-scale project cycle finds nearest optimal sparsity values, without a drop in the inference accuracy.
Referring now back to the above example the optimal pruning search technique (Table 2) performs the steps by initializing, each DNN model, a maximum population size, a mutation rate with highest predefined probability threshold value, a layer-wise pruning ratios, a plurality of second performance indicators comprising of an accuracy, and a network sparsity, the one or more accelerating elements, a lower bound mutation rate, and a fitness score function. Further, iteratively executes the loop until it reaches the maximum population size to create an individual element for each layer based on the pruning ratio associated with each layer of the DNN model. Here, the pruning ratio for each layer is randomly generated. Further, computing at least one of the second performance indicator value of the individual element and the hardware accelerated DNN model and recording each individual entity with corresponding pruning ratios into a population batch.
In one embodiment, optimal layer wise search is performed to determine each hardware accelerated DNN model by iteratively executing to select a fittest individual element from the population batch using the fitness score function. The fitness score function is a function that calculates an individual elements fitness in the population batch depending on a primary metrics and a secondary minimum metrics. Further, a new individual element is created randomly for selecting the layers of each DNN model and randomly changing the pruning ratios associated with each layer based on the mutation rate. Here, the mutation rate linearly decreases at every search step and ends when the mutation rate is equal to the lower bound mutation rate. Further, the method computes at least one of the second performance indicator value of each new individual element and the hardware accelerated layers of each DNN model. The new individual element is updated into the population batch and removing the least fit individual element from the population batch.
The optimal pruning search technique uses a stochastic approach to solve complex problems, with simple initial conditions and rules. Such technique iteratively improves population of solutions, and often perform better than deterministic approaches. This search technique has been used to prune networks that finds the optimal layer-wise pruning ratio Δ, resulting in higher sparsity, and the low latency models. However, these models often suffer from the accuracy drop, and require re-training to recover loss in the accuracy. The optimal pruning search technique enables to find optimal configurations such that the resultant model has the same accuracy as the base model. This reduces the overall search time, and the computational load. The base model can be defined as a set of operations denoted by ={I1, I2, . . . , IL} with an operation lk, where k∈{1, 2, . . . , L}, can either be a convolution or max pooling or even an activation like ReLU or tanh. The pruned layers are trained with trainable parameters or weights W, represented by pruning the layer represented in Equation 1,
W
k′=prune(Wk|A) Equation 1
Every trainable layer can be pruned using the Equation 1. The resultant model ′ will have an associated accuracy (Aprune) and overall network sparsity (SPprune) is represented by a tuple Aprune/SPprune. The optimal pruning search technique uses these as a set of objectives to search to find the optimal combination of layer-wise Δvalues for all layers. Such that it has high Spprune and its Aprune is the same as Abase.
In another embodiment, the pruning search model solves multi-objective optimization problem (MOOP). The multi-objective optimization problem MOOP is a class of problem which solves for a set of results in a space of solutions instead of a single solution. Given an arbitrary search space X, and a set of objective functions , the objective is to find a subset X′ which satisfies all the objectives given in the problem with best possibilities. This subset X′ is called the pareto-optimal set or the pareto-front and is mathematically described below. Considering x as any solution from the set X. The set of objectives as (x)=min{f0(x), f1(x), . . . fN(x)} where N is the total number of objectives fi(i∈0, 1, . . . N) represents a single objective. The solution xa is represented in Equation 2, and xb is represented in Equation 2 and Equation 3, if,
∀i∈{0,1, . . . ,N},fi(xa)≤fi(xb) Equation 2
∃j∈{0,1, . . . ,N},fj(xa)<(xb) Equation 3
In this case that xa>xb (read as xa dominates xb). There exists such a set X′ for which xa cannot dominate xb. This improves the solution xa in one objective without degrading in the other objectives. This set X′ is then defined as the pareto-optimal set of solutions for which obtains the optimal values for the given objectives. To implement pareto-optimality, a single scalar can efficiently represent the value from each of the given objectives. The multi-objective cost function is represented in the Equation 4,
=Σa=0Nλaya Equation 4
Where, ya={1−Aa, 1−Spa} and N is the total number of objectives. This generates a single scalar which is used to determine the fitness. To optimize the inverse of accuracy and sparsity of the network, the error rate and the network density are minimized using the Equation 4, where is a weight vector initialized by the user.
The set of layer-wise pruning ratios Δ for each DNN model is the pruning configuration and to finding the optimal set of Δ can minimize the objectives. The method of the present disclosure first generates a random population batch of configurations and select the best-fit instance for further mutation. The optimal pruning search technique uses a cost function (Equation to determine the fitness of each DNN model. Here, selection of the best fit individual element from the entire population is shown to be faster. The user selection process returns one individual configuration called the parent configuration, or simply the parent. The values of such configuration (like some Δs in its encoding) may require a random modification which is called as mutation. This results in a new individual element configuration that can produce a better global sparsity at an accuracy closer to the base model. Typically, mutation happens when a change in probability value falls below a threshold probability known as the mutation rate. Lower the mutation rate is the chance of mutation for any member of the configuration.
The dynamic mutation rate, where the probability threshold value changes linearly throughout the pruning search such as diversity in the population size. By linearly reducing the mutation rate, the amount of mutations that takes place for the fittest individual also reduces. The optimal pruning search technique along with the effect of each configuration on the accuracy and the sparsity of the pruned accelerated models for each processing elements (Table 2). The optimal pruning search technique defines the bound parameter serving as the upper bound to the model's error rate (1−Abase) and the target density (1−SD target) as represented in the below steps.
Step 1: In lines 4 to 10 of Table 2 the population size is created (of size POP_SIZE) for randomly generated configuration. Then, the configurations are recorded on the model when accelerated with processing elements or processors and record the accuracy for each processing elements.
Step 2: The mutation rate mut_rate is set to the highest value with 1 in line 1. The probability threshold is in the range [0, 1].
Step 3: With the initial population created, the search is performed from lines 14 to 23 of Table 2. The best fit individual from the population is selected and mutated.
Step 4: The mutation rate is decreased after every search step in line 25 of Table 2. The lower bound mutation rate by α to allow the continued mutation.
Step 5: After the search is completed, the fittest config is selected from the population in line 27 of Table 2.
This is the optimal combination of the pruning ratios corresponding to each layer. The bases model is pruned with the pruning ratio, and accelerated further in lines 28-30, to generate the final set of transformed models for all the processing elements. It may be noted here that during the creation of the new individual entity, both in the population creation phase and in the mutation phase. This involves pruning the model and corresponding acceleration for all the processing elements and storing the minimum accuracy (Aprune), and network sparsity (Sprune) for all the processing elements.
Referring now to the steps of the method 300, at step 306, the one or more hardware processors 104 identify an optimal layer associated with the pruned hardware accelerated DNN model based on the user option. From the above example by performing either with the user options for the optimal layers are identified from the pruned hardware accelerated DNN models.
Referring now to the steps of the method 300, at step 308, the one or more hardware processors 104 create by using a layer assignment sequence technique a static load distributor by partitioning the optimal layer of the DNN model into a plurality of layer sequences and assigning each layer sequence to corresponding processing element of hardware accelerators. Referring now back to the example, dynamic programming problems are often solved using a tabular approach. Towards that, attempt to create a table similar (
The layer assignment sequence technique (referring to Table 3) obtains from a layer splitter, each layer of the DNN model associated with the pruned accelerated DNN model based on at least one user option. Further, a first column of a layer table is filled with cumulative execution latency of each layer on a first processor, and a first row of the layer table with the sum of the execution latency. The data transfer latency on the first layer is filled for all the participating processor(s).
Then, a schedule of each processor is obtained in a recursive bottom-up manner for filling up the complete layer table. Further, a schedule array index is created for all the layers to obtain the optimal schedule of each layer. The schedule array index is indexed and each processor is assigned with a number to the array location corresponding to the indexing. The layer table is re-traced to obtain the optimal schedule of each layer.
To incrementally fill up the layer table, the following steps are performed (Table 3),
Step 1: Obtain the individual layers and profiling information from the layer splitter.
Step 2: Start filling the first column, corresponding to the cumulative execution latencies of the all the layers on the first processor, and the first row corresponding to the sum of execution latency and the data transfer latency the first layer, on all the participating processors (
Step 3: Based on the Equation (5) obtain the schedule for each processor in the recursive, bottom-up manner as and finally filling up the complete layer table.
Step 4: The layer table is re-traced to obtain the optimal schedule of each layer. Specifically, creating the schedule array indexing by layers, and assigning the processor number to the array location corresponding to the index of the layer. An example of such a schedule output is as follows: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], where the first four layers are assigned to the first processor and the next seven layers are assigned to consecutive processor(s).
The layers in each DNN model are essentially sequential. Layer partitioning with proper distribution of the subset of layers on multiple devices have a effect on overall average latency of the inference tasks by minimizing the make-span of each task. This improves overall resource utilization and common vision application in embedded IoT. In such scenarios, minimizing the latency impacts overall throughput is essential in the embedded domain. The capability of underlying processing elements and the bandwidth of the bridge between the host and the accelerators determines the optimal partitioning of the DNN model into the sub-sequences of layers. For instance, the bandwidth between the host processor(s) cores are in the order of gigabits per second, whereas the bandwidth between the host and a universal serial bus can be a thousand times less.
Referring to an example, depicting the effects of pipelined parallelism by partitioning the model layers and assigning them to different devices using an image identification application. In this example, it has been assumed that each DNN model consisted of layers such as {L1, L2, . . . , L6} which are partitioned into three subsets, namely {{L1, L2, L3}, {L4, L5}, L6}. Each subset is assigned to one of the three available devices depending on the device parameters, and the pipelined parallelism leads to reduction in the inference latency.
A brute force approach can achieve all possible subsequence combinations on all possible processing elements in the heterogeneous computing platform. Capacity based partitioning approach can be employed by assigning the largest possible sub-sequence of layers to the current processing element, considering one processing element at a time. However, this greedy approach does not necessarily assign balanced layers to the host and the accelerator(s). Workload distribution and scheduling of each DNN model execution graphs are necessary to optimize each DNN model for the target platform. Specifically, in a Dynamic Programming (DP) approach partitioning provides the optimal assignment and the optimization space is relatively smaller.
The DP based formulation is best for latency improvement for both parallel and pipeline cases. The system 100 has been evaluated with virtual machines with server grade CPU and 64 GB RAM, emulating IoT devices. Further, deploying large DNN model on the edge node by partitioning the model graph based on the capacities of the computing hierarchy is complex. The DP formulation considers the performance benefit which improves fused-layer partitioning for minimizing the total execution time for single inference.
When the sub-sequence of layers (I1, . . . Ij) are assigned to a single processor, such assignment were given as Oj1=T1,j1 depicting the cost matrix for solving dynamic programming with the rows, columns and cells representing layers, processing elements and optimal assignments. This assignment is equivalent to filling up the first column of the layer table with the layer latency .
In the recursive step, the assignment of each layer (I1, . . . , Ij) on the processing element(s)(p1, . . . , pk) requires optimal assignment of layers (I1, . . . , Ii−1) on (p1, . . . , pk−i) and the assignment of (Ii, . . . , Ij) on pk is represented in Equation 5,
Experimental evaluation of the hardware pruning search technique, the pruning resolution, the upper bound to the pruning ratio search space were set to 0.99. Initially, the pruning ratio are updated and set to 0.2 and 0.1 respectively and continues searching until the pruning ratio exceeds the pruning resolution or drops below a defined precision limit. It is identified that 10−3 is enough to search the optimal pruning ratios, for such fine-grained search. Table 5 compares overall sparsity for the ResNet 50 architecture using random exhaustive search and the hardware pruning search technique.
The randomly generated pruning ratio for each layer of the ResNet 50 architecture provides higher global sparsity than the hardware pruning search technique. This delivers for each model the set of pruning ratios per layer, such that the global sparsity of each DNN model can be maximized without accuracy drop.
Table 6 shows comparison of both the search techniques with the state-of-the-art automated pruning system AMC. AMC is an AutoML method that uses reinforcement learning to learn the optimal distribution of the pruning ratios per layer for pruning the DNN model. AMC searches in two modes: accuracy guaranteed, which prunes the network by maintaining its accuracy, and FLOPs constrained compression which trades for reduced FLOPs at the cost of inference accuracy. As reported in Table 6, the optimal pruning search technique finds better individual element at the cost of a slightly larger search time than AMC.
The hardware pruning search technique achieves the least sparsity improvement among the three algorithms, albeit taking much less time to search. For all the methods, the accuracy of the pruned model remains the same.
In one existing method iteratively pruning and quantizing the network and validate its performance on a target hardware using approximations. However, during the quantization phase, fine-tuning the network recovers some accuracy points. Instead of using approximate computing to validate how a pruned network may perform on a target hardware, deployment and testing during search using the optimization defined by each processor. The optimal pruning search technique generates two different pruned and accelerated models for the host system (NVIDIA JetSon Nano) and the USB connected DNN accelerator (Coral TPU). The network used here is ResNet-34 trained on the CIFAR-10 dataset. Table 7. shows the effect of using the optimal pruning search technique and generating two different pruned and accelerated models for the host board (JetSon Nano) and the USB accelerator (Coral TPU). The DNN ResNet-34, trained on the CIFAR-10 dataset for this experiment.
Sample schedule resembles [0, 0, 1, 1, 1, 1, . . . , 1], denoting that the first two layers execute on Nano and all subsequent layers are assigned to the TPU. Table 8 presents the detailed results of applying the layer assignment sequence technique on the ResNet34 trained on the CIFAR-10 data and the two other models trained on ImageNet.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The embodiments of present disclosure herein addresses unresolved problem of pruning deep neural network (DNN). The embodiment, thus provides method and system for jointly pruning and hardware acceleration of pre-trained deep learning models. Moreover, the embodiments herein further provides a unified approach for pruning the DNN model layers with zero loss in accuracy drop due to hardware acceleration. This is achieved for pruning the deep learning model using the most optimal pruning ratios with no loss in accuracy. The method of the present disclosure has two different pruning techniques with iterative refinement strategy for pruning considering the subsequent hardware acceleration on the DNN accelerators. The method generates the transformed and accelerated DNN models for all the computing elements and maps the DNN model layers to the set of computing elements with heterogeneous capacity. The method has higher efficacy of such partitioning and inference workload mapping on an actual embedded host and a DNN accelerator.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g., any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g., hardware means like e.g., an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g., an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means, and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g., using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202221043520 | Jul 2022 | IN | national |