Energy-efficiency of artificial intelligence (AI) workloads, such as running neural networks for visual recognition, is key for mobile and automotive hardware platforms. However, hand-optimizing efficient on-platform neural networks is prohibitively time and resource consuming. In this process, a trained engineer will first select a neural network(s) from experience, among trillions of architectural options. Then, the selected neural network(s) will be trained from scratch, requiring around 400-500 GPU-hours each to complete the design process. When the trained neural network does not achieve its efficiency targets, typically a set quality level (e.g., classification accuracy) for a fixed hardware metric (e.g., latency on a platform), or whenever driver software or hardware platform changes, the process has to be repeated.
Various aspects of the disclosure provide methods executable on a computing device for selecting a neural network. Various aspects include using an accuracy predictor to select from a search space a neural network comprising a first plurality of the blockwise knowledge distillation trained search blocks, in which the accuracy predictor is built using blockwise knowledge distillation trained search blocks that were trained from the search space.
Various aspects may include selecting a second plurality of the blockwise knowledge distillation trained search blocks based on criteria of predicted accuracy using the accuracy predictor and a cost function for implementing the second plurality of the blockwise knowledge distillation trained search blocks.
In some aspects, selecting the second plurality of the blockwise knowledge distillation trained search blocks may include using an evolutionary search to select the second plurality of the blockwise knowledge distillation trained search blocks.
In some aspects, the second plurality of the blockwise knowledge distillation trained search blocks are Pareto-optimal blockwise knowledge distillation trained search blocks.
In some aspects, using the accuracy predictor to select from the search space the neural network may include selecting the first plurality of the blockwise knowledge distillation trained search blocks using a scenario-aware search to select the first plurality of the blockwise knowledge distillation trained search blocks.
Some aspects may include initializing the first plurality of the blockwise knowledge distillation trained search blocks using weights of the blockwise knowledge distillation trained search blocks and fine-tuning the neural network using knowledge distillation.
Some aspects may include selecting a sub-set of neural networks of the search space, in which each neural network of the sub-set of neural networks may include blockwise knowledge distillation trained search blocks of the generated blockwise knowledge distillation trained search blocks; initializing the blockwise knowledge distillation trained search blocks of the sub-set of neural networks using weights of the blockwise knowledge distillation trained search blocks, and fine-tuning the sub-set of neural networks using knowledge distillation.
Some aspects may include extracting a quality metric by using blockwise knowledge distillation to train the neural network blocks from the search space and extracting a target by fine-tuning the sub-set of neural networks using knowledge distillation, in which the accuracy predictor is built using a linear regression model from the quality metric to the target.
In some aspects, using the accuracy predictor to select from the search space the neural network may include selecting the neural network of the search space based on a search of the blockwise knowledge distillation trained search blocks using a criterion of predicted accuracy using the accuracy predictor and a cost function for implementing blockwise knowledge distillation trained search blocks of the neural network, and such aspects may further include initializing the second plurality of the blockwise knowledge distillation trained search blocks using weights of the blockwise knowledge distillation trained search blocks, and fine-tuning the neural network using knowledge distillation, to generate a distilled neural network.
Some aspects may include using blockwise knowledge distillation to train neural network blocks from an extended search space to generate blockwise knowledge distillation trained search blocks and quality metrics, using the accuracy predictor to predict accuracy of the extended search space, in which the accuracy predictor is built for the search space different from the extended search space.
Further aspects include a computing device including a processor configured with processor-executable instructions to perform operations of any of the methods summarized above. Further aspects include a non-transitory processor-readable storage medium having stored thereon processor-executable software instructions configured to cause a processor to perform operations of any of the methods summarized above. Further aspects include a computing device having means for accomplishing functions of any of the methods summarized above.
The accompanying drawings, which are incorporated herein and constitute part of this specification, illustrate example embodiments of various embodiments, and together with the general description given above and the detailed description given below, serve to explain the features of the claims.
Various embodiments will be described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made to particular examples and implementations are for illustrative purposes, and are not intended to limit the scope of the claims.
Various embodiments provide methods for selecting a neural network architecture suitable for a hardware configuration. Various embodiments may include using an accuracy predictor to select from a search space a neural network comprising a first plurality of the blockwise knowledge distillation trained search blocks. Various embodiments may include building the accuracy predictor using blockwise knowledge distillation trained search blocks that were trained from the search space. Various embodiments may include implementing a search for identifying knowledge distillation trained neural network blocks from the search space based on predicted accuracy and any number and combination of cost functions for the search blocks. Various embodiments may include fine-tuning a neural network made of knowledge distillation trained neural network blocks from the search space selected based on the search and using to generate a distilled neural network. In various embodiments, fine-tuning the neural network use weights initialized from the knowledge distillation training of the neural network blocks from the search space. In various embodiments, fine-tuning the neural network use knowledge distillation to fine-tuning the neural network.
The term “computing device” is used herein to refer to any one or all of servers, personal computers, mobile device, cellular telephones, smartphones, portable computing devices, personal or mobile multi-media players, personal data assistants (PDA's), laptop computers, tablet computers, smartbooks, IoT devices, palm-top computers, wireless electronic mail receivers, multimedia Internet enabled cellular telephones, connected vehicles, wireless gaming controllers, and similar electronic devices that include a memory and a programmable processor.
The term “neural network” is used herein to refer to an interconnected group of processing nodes (e.g., neuron models, etc.) that collectively operate as a software application or process that controls a function of a computing device or generates a neural network inference. Individual nodes in a neural network may attempt to emulate biological neurons by receiving input data, performing simple operations on the input data to generate output data, and passing the output data (also called “activation”) to the next node in the network. Each node may be associated with a weight value that defines or governs the relationship between input data and activation. The weight values may be determined during a training phase and iteratively updated as data flows through the neural network.
Deep neural networks implement a layered architecture in which the activation of a first layer of nodes becomes an input to a second layer of nodes, the activation of a second layer of nodes becomes an input to a third layer of nodes, and so on. As such, computations in a deep neural network may be distributed over a population of processing nodes that make up a computational chain. Deep neural networks may also include activation functions and sub-functions (e.g., a rectified linear unit that cuts off activations below zero, etc.) between the layers. The first layer of nodes of a deep neural network may be referred to as an input layer. The final layer of nodes may be referred to as an output layer. The layers in-between the input and final layer may be referred to as intermediate layers, hidden layers, or black-box layers.
Each layer in a neural network may have multiple inputs, and thus multiple previous or preceding layers. Said another way, multiple layers may feed into a single layer. For ease of reference, some of the embodiments are described with reference to a single input or single preceding layer. However, it should be understood that the operations disclosed and described in this application may be applied to each of multiple inputs to a layer as well as multiple preceding layers.
Hand-optimizing efficient on-platform neural networks is prohibitively time and resource consuming, requiring a trained engineer to select a neural network(s) based on experience, from among trillions of architectural options. The selected neural network(s) must be trained from scratch, requiring around 400-500 GPU-hours each to complete the design process. When the trained neural network does not achieve its efficiency targets, typically a set quality level (e.g., classification accuracy) for a fixed hardware metric (e.g., latency on a platform), or whenever driver software or hardware platform changes, the process has to be repeated.
Neural architecture search methods (NAS) are used to help alleviate the costs associated with hand-optimize efficient op-platform neural networks. However existing NAS methods are either: (a) still too expensive in terms of resources, (b) have limitations on the types of architectures they can search for, or (c) cannot be used to optimally design for modern hardware platforms directly.
The embodiments described herein may provide improvements on the high cost of hand-optimizing efficient on-platform neural networks, and improvements on the limitations of existing NAS methods, by automating the design process for a wide and diverse set of potential neural network architectures and hardware platforms and by making the design process less resource intensive. Some embodiments may reduce the resource requirements of designing energy efficient on-platform neural networks both in terms of man-hours and compute costs. These reductions in resource requirements may be achieved by (1) a novel way to build accuracy models for a diverse search space of candidate neural network architectures with a variety of cell-types. Blockwise knowledge distillation may be implemented to build accuracy models from a diverse neural network architectural search space. Using blockwise knowledge distillation may allow for cheap modeling of the accuracy of neural networks with varying micro-architectures (network-depths, kernel-sizes and expansion-rates), and also across varying macro-architectures (cell-types, attention-mechanisms, activation functions and channel-widths), which is not a feature of existing NAS methods. In addition, the accuracy models may allow for better prediction of the ranking of neural network architectures than existing NAS methods.
Some embodiments may further include (2) a quick evolutionary search-phase extracting a front of architectures in terms of accuracy and some on-target efficiency metric (e.g., number of operations, latency, energy consumption, etc.), such as a Pareto-optimal front. The evolutionary search may be performed using the prior accuracy model together with hardware measurements in the loop and may be repeated quickly many times, amortizing the resource-costs of building the accuracy model. A brief search phase may find latency-accuracy Pareto-optimal neural network architectures for any use-case or hardware platform by running a 2D-optimization algorithm using the accuracy model together with hardware measurements in the loop, which may be quickly rerun whenever anything changes to a use-case, hardware platform, or platform software version.
Some embodiments may further include that (3) the way the accuracy model is built in (1) may allow fine-tuning of any neural network in a search space quickly, up to full accuracy. This may be 9 to 30 times faster than training the neural network from scratch.
The embodiments described herein may be scalable for use in multiple use-case, for multiple hardware configurations, or multiple platform software configurations. The upfront costs of building an accuracy model by performing blockwise knowledge distillation, which requires training partial neural networks (blocks) may be amortized by the ability to reuse the accuracy model multiple times for various circumstances. Using a built accuracy model, the cost of a search scales linearly with the number of different use-cases, multiple hardware configurations, or multiple platform software configurations to which the accuracy model is applied to find efficient neural networks. The embodiments described herein may be highly parallelized, such that the embodiments can adapt to a resource budget of any number of processors (e.g., GPUs), as opposed to existing NAS methods that can only scale up their compute to certain practical maximum numbers of GPUs before training can become unstable.
In some embodiments, the neural network designed using an accuracy model, evolutionary search, and fine-tuning may also be applicable to downstream tasks. For example, the designed neural network may be an image classification neural network and may be applicable for use in computer vision tasks, such as object detection, semantic segmentation models, super-resolution models, video classification, video segmentation, etc. In some embodiments, the designed neural network may be applicable to downstream tasks through reusing designed classification networks as a reference. In some embodiments, the designed neural network may be applicable to downstream tasks through performing a search directly on a downstream neural network.
In some embodiments, the neural network designed using an accuracy model, evolutionary search, and fine-tuning may also be applicable to designing, compressing, improving, and/or selecting hardware for other neural networks.
In feed-forward neural networks, such as the neural network 100 illustrated in
The neural network 100 illustrated in
An example computation performed by the processing nodes and/or neural network 100 may be:
in which Wij are weights, xi is the input to the layer, yj is the output activation of the layer, f(•) is a non-linear function, and b is bias, which may vary with each node (e.g., bj). As another example, the neural network 100 may be configured to receive pixels of an image (i.e., input values) in the first layer, and generate outputs indicating the presence of different low-level features (e.g., lines, edges, etc.) in the image. At a subsequent layer, these features may be combined to indicate the likely presence of higher-level features. For example, in training of a neural network for image recognition, lines may be combined into shapes, shapes may be combined into sets of shapes, etc., and at the output layer, the neural network 100 may generate a probability value that indicates whether a particular object is present in the image.
The neural network 100 may learn to perform new tasks over time. However, the overall structure of the neural network 100, and operations of the processing nodes, do not change as the neural network learns the task. Rather, learning is accomplished during a training process in which the values of the weights and bias of each layer are determined. After the training process is complete, the neural network 100 may begin “inference” to process a new task with the determined weights and bias.
Training the neural network 100 may include causing the neural network 100 to process a task for which an expected/desired output is known, and comparing the output generated by the neural network 100 to the expected/desired output. The difference between the expected/desired output and the output generated by the neural network 100 is referred to as loss (L).
During training, the weights (wij) may be updated using a hill-climbing optimization process called “gradient descent.” This gradient indicates how the weights should change in order to reduce loss (L). A multiple of the gradient of the loss relative to each weight, which may be the partial derivative of the loss
with respect to the weight, could be used to update the weights.
An efficient way to compute the partial derivatives of the gradient is through a process called backpropagation, an example of which is illustrated in
The input layer 201 may receive and process an input signal 206, generate an activation 208, and pass it to the intermediate layer(s) 202 as black-box inputs. The intermediate layer(s) inputs may multiply the incoming activation with a weight matrix 210 or may apply one or more weight factors and/or a bias to the black-box inputs.
The nodes in the intermediate layer(s) 202 may execute various functions on the inputs augmented with the weight factors and the bias. Intermediate signals may be passed to other nodes or layers within the intermediate layer(s) 202 to produce the intermediate layer(s) activations that are ultimately passed as inputs to the output layer 204. The output layer 204 may include a weighting matrix that further augments each of the received signals with one or more weight factors and bias. The output layer 204 may include a node 242 that operates on the inputs augmented with the weight factors to produce an estimated value 244 as output or neural network inference.
The neural networks 100, 200 described above include fully-connected layers in which all outputs are connected to all inputs, and each processing node's activation is a weighted sum of all the inputs received from the previous layer. In larger neural networks, this may require that the network perform complex computations. The complexity of these computations may be reduced by reducing the number of weights that contribute to the output activation, which may be accomplished by setting the values of select weights to zero. The complexity of these computations may also be reduced by using the same set of weights in the calculation of every output of every processing node in a layer.
Some neural networks may be configured to generate output activations based on convolution. By using convolution, the neural network layer may compute a weighted sum for each output activation using only a small “neighborhood” of inputs (e.g., by setting all other weights beyond the neighborhood to zero, etc.), and share the same set of weights (or filter) for every output. A set of weights is called a filter or kernel. A filter (or kernel) may also be a two- or three-dimensional matrix of weight parameters. In various embodiments, a computing device may implement a filter via a multidimensional array, map, table or any other information structure known in the art.
Generally, a convolutional neural network is a neural network that includes multiple convolution-based layers. The use of convolution in multiple layers allows the neural network to employ a very deep hierarchy of layers. As a result, convolutional neural networks often achieve significantly better performance than neural networks that do not employ convolution.
With reference to
The convolution functionality component 302, 312 may be an activation function for its respective layer 301, 311. The convolution functionality component 302, 312 may be configured to generate a matrix of output activations called a feature map. The feature maps generated in each successive layer 301, 311 typically include values that represent successively higher-level abstractions of input data (e.g., line, shape, object, etc.).
The non-linearity functionality component 304, 314 may be configured to introduce nonlinearity into the output activation of its layer 301, 311. In various embodiments, this may be accomplished via a sigmoid function, a hyperbolic tangent function, a rectified linear unit (ReLU), a leaky ReLU, a parametric ReLU, an exponential LU function, a maxout function, swish, etc.
The normalization functionality component 306, 316 may be configured to control the input distribution across layers to speed up training and the improve accuracy of the outputs or activations. For example, the distribution of the inputs may be normalized to have a zero mean and a unit standard deviation. The normalization function may also use batch normalization (BN) techniques to further scale and shift the values for improved performance.
The pooling functionality components 308, 318 may be configured to reduce the dimensionality of a feature map generated by the convolution functionality component 302, 312 and/or otherwise allow the convolutional neural network 300 to resist small shifts and distortions in values.
With reference to
In block 602, the processor may define a reference neural network for a search-space. The reference neural network, as describe herein with reference to
In block 604, the processor may define varying parameters for the search space. The varying parameters may be used to define the neural network architectures included in the search-space. The varying parameters may include varying cell-type (e.g., style of convolutions), attention mechanisms, kernel sizes, number of layers per block, activation functions, expansion rates, network width, network depth, etc. Any number and combination of the varying parameters and the constraints of the reference neural network may define a block or neural network architecture in the search space. In some embodiments, the processor may receive varying parameters from a user input or from a memory accessible to the processor.
With reference to
To implement the blockwise knowledge distillation a reference neural network 700 may be selected, trained, and split it into N blocks BT_n. This block BT_n may transform input feature map DT_(n-1) into DT_n using a transfer function FT_n(WT_n), where WT_n are the parameters of the block.
Using the trained reference neural network 700, the parameters of all possible blocks B_xn in the search space may be trained. This may allow for (A) extracting quality metrics useful in building supervised regression models for accuracy and (B) initializing their weights for building unsupervised accuracy models, or for quick fine-tuning of models in the search space.
In some embodiments, training the parameters of all possible blocks B_xn in the search space may be implemented using blockwise knowledge distillation. The parameters W_xn of all the blocks B_xn, which define the transfer function F_xn(W_xn), may be trained using a blockwise knowledge distillation scheme to approximate the reference function F_n as closely as possible. This is done in a block-wise way by using stochastic gradient descent, requiring gradient back-propagation only from DT_n to DT_(n−1) through B_xn. This process may reduce MSE, per-channel Noise-To-Signal ratio (NSR), or any other relevant loss function of the output features D_xn relative to the trained reference neural network 700 DT_n, using the trained reference neural networks' DT_(n−1) input feature maps. In some embodiments. This blockwise knowledge distillation may converge the loss function quickly, such as after 1 full epoch of training.
In some embodiments, training the parameters of all possible blocks B_xn in the search space may be implemented using conventional knowledge distillation: In some embodiments, the weights W_xn of B_xn may be updated using conventional knowledge distillation, by minimizing a loss function defined by the tasks ground truth and the output classifier of the trained reference neural network 700. In such embodiments, a new neural network may be constructed where as few as 1 out of the N blocks of the trained reference neural network 700 is replaced by block B_xn. The new neural network is then used to train the weights W_xn of B_xn through knowledge distillation.
A knowledge distillation, blockwise or conventional, trained neural network 700 is described as M_xn. In some embodiments other methods may be used to train the blocks of the search space.
The knowledge distillation, blockwise or conventional, may be used as a means to measure the quality of a block B_n,x in the search space and to initialize its weights for later fine-tuning. The knowledge distillation process may result in a block library of quality metrics, an example of which is illustrated in
A dataset of fully trained neural networks may be built for the search space. The built accuracy predictor may be trained using parameters including features (e.g., quality metrics) and targets (e.g., accuracy), an example of which is illustrated in
These fine-tuned neural networks 704 may then be used to build an accuracy predictor 706, an example of which is illustrated in
In some embodiments, the NSR of the feature maps may be between a trained reference neural network 700 and fine-tuned neural network 704. This may be understood as a measure of the distance between F_n and F_xn. The closer F_xn is to F_n, the higher the quality of block B_xn and the lower the average NSR or MSE. Herein, these features may be referred to as B_xna, as there are many different types of metrics that can be used here. In some embodiments, Signal-To-Noise ratio (SNR) may be used in the same way. In some embodiments, NSR may be used in the same way. In some embodiments, MSE may be used in the same way.
In some embodiments, training/validation loss of a fine-tuned neural network 704 M_xn may be used as an accuracy target to fit the accuracy predictor 706. In some embodiments, the accuracy of the fine-tuned neural network 704 may be determined on a validation set. In some embodiments, the accuracy of the fine-tuned neural network 704 may be determined on a part of a fine-tuning set.
In some embodiments, direct validation, by validating samples of the search space M_n, with all blocks B_xn being fine-tuned may be used. After batch-norm statistics are reset for the network M_n, the resulting network may be validated. This validation accuracy may be used to model the accuracy of when the network M_n should be trained from scratch.
In some embodiments, other available measures may be used as targets to build an accuracy predictor 706.
In some embodiments, using a dataset of fully trained neural networks (e.g., fine-tuned neural networks 704, up to full accuracy) in the search space, a supervised accuracy model may be built using features (e.g., quality) and targets (e.g., accuracy) via various different means. In some embodiments, the accuracy model may be built using a linear regression model. In some embodiments, the accuracy model may be built using a gradient boosting regression model. In some embodiments, the accuracy model may be built using a multilayer perceptron models. In some embodiments, the accuracy model may be built using a graph convolutional neural network combining accuracy features and graph features.
The means to build an accuracy model described herein, poses no limitations on the variety of the search space, as opposed to existing NAS methods, which either rely on weight sharing or have a limited search space due to GPU-memory limitations. These limitations of existing NAS methods severely limit their variety as not all cell-types or activation and attention styles can share weights. In existing NAS methods, because of their reliance on weight sharing, all blocks B_xn would have to belong to the same cell-type (depth wise separable convolutions for example) with the same activation functions and attention mechanisms. The means described herein, has may not be similarly limited, and may allow for mixing different cell-types, activation functions, quantization levels and/or attention mechanisms, while still being able to model their accuracy reliably. In some embodiments, means to build an accuracy model described herein may allow for building an accuracy model for a single search space with different attention mechanisms, activation functions, cell-types, channels, quantization settings. In some embodiments, means to build an accuracy model described herein may allow for adding quantized blocks into the search space, which may build an accuracy model for quantized networks directly. In some embodiments, the accuracy predictor 706 may be understood as a coarse sensitivity model that may indicate which blocks require complex implementations in order to build neural networks with high accuracy.
In block 802, the processor may train any number and combination of search blocks from the search space using knowledge distillation with a trained reference neural network, as describe herein with reference to
In block 804, the processor may extract quality features and initialize weights for each search block through knowledge distillation, as described herein with reference to
In block 806, the processor may store the quality features and initialize weights for each trained search block and/or neural network architecture. The processor may store the quality features and initialize weights in a manner that associates the quality features and initialize weights with respective trained search blocks and/or neural network architectures. The processor may store the quality features and initialize weights using any manner of storage, such as a database, a list, a table, an array, etc. The processor may store the quality features and initialize weights to any memory (e.g., memory 1702, disk drive 1703 in
In block 808, the processor may select a sub-set of neural networks from the search space as targets for building an accuracy model. In some embodiments, the processor may select the sub-set of neural networks from the search space based on a programmed algorithm, heuristic, technique, criteria, etc. In some embodiments, the processor may select the sub-set of neural networks from the search space based on a user input. In some embodiments, each of neural networks of the sub-set may include any combination of trained search blocks. In some embodiments, the neural networks of the sub-set may include any combination of trained neural networks of the search space.
In block 810, the processor may train fine-tune the sub-set of neural networks as targets for building an accuracy model using knowledge distillation, as described herein with reference to
In block 812, the processor may extract targets for building an accuracy model using the fine-tuned neural networks, as described herein with reference to
In block 814, the processor may generate an accuracy predictor for neural networks from the search space using the quality features and targets, as described herein with reference to
For example, as illustrated in
In block 1002, the processor may set search parameters. In some embodiments, search may parameters may include a predicted accuracy and a cost function for executing search blocks, as described herein with reference to
In block 1004, the processor may determine search parameter values for search blocks. In some the processor may apply the accuracy predictor to the search blocks to determine the predicted accuracy values for the search blocks. In some embodiments, the processor may calculate cost function values for implementing the search blocks. In some embodiments, the processor may measure cost function values for implementing the search blocks. In some embodiments, the processor may retrieve cost function values for implementing the search blocks from the memory accessible to the processor. In some embodiments, the processor may receive cost function values for implementing the search blocks via user input. In some embodiments, the cost function may be scenario-agnostic. In some embodiments, the cost function may be scenario-aware. The scenarios may include hardware configurations, software versions, input data parameters, a number of operations or a number of parameters in a neural network, on-device latency, throughput, energy, etc.
In block 1006, the processor may determine search blocks suited to the criteria. The processor may execute a search algorithm. In some embodiments, the search algorithm may be an evolutionary search algorithm, described herein with reference to
In block 1008, the processor may select the search blocks suited to the criteria. In some embodiments, the processor may select any number and combination of search blocks suited to the criteria, such as N search blocks suited to the criteria. In some embodiments, the processor may select the search blocks best suited to the criteria. In some embodiments, the processor may select search blocks suited to the criteria within selection parameters. In some embodiment the selection parameters may include a function of the cost function for any number and combination of the search blocks suited to the criteria. In some embodiment the selection parameters may include a function of the predicted accuracy for any number and combination of the search blocks suited to the criteria.
In block 1202, the processor may select a neural network from the search space, as described herein with reference to
In block 1204, the processor may initialize the neural network from the search space using weights from the training of the search blocks using knowledge distillation, as described herein with reference to
In block 1206, the processor may fine-tune the neural network having search blocks initialized with the weight from the knowledge distillation by using knowledge distillation with a trained reference neural network, as described herein with reference to
A downstream task may be a task that is not directly performed by a distilled neural network, or a fine-tuned neural network, from the search space, as described herein with reference to
In
The sampling algorithm may also take cost function metric 1308, 1358 (e.g., measured latency, predicted latency). To determine a cost function value, the distilled neural network may be embedded into a larger neural network, which may include parts 1310, 1360 (e.g., device, target hardware (HW)) for implementing the downstream task. The cost function metric may be measured for the larger neural network including the distilled neural network implemented on a hardware or hardware simulator. The sampling algorithm may use the cost function metric to update the search for search blocks to be used for a neural network configured to contribute to the downstream task.
In some embodiments, the search for suitable search blocks may be executed based on a predicted accuracy of a task generated by an accuracy predictor (e.g., accuracy predictor 706 in
As another example, as illustrated in
In
The sampling algorithm may also take cost function metric 1408 (e.g., predicted latency). To determine a cost function value, the distilled neural network may be embedded into a larger neural network, which may include parts 1410 (e.g., Target HW) for implementing the downstream task. The cost function metric may be measured for the larger neural network including the distilled neural network implemented on a hardware or hardware simulator. The sampling algorithm may use the cost function metric to update the search for search blocks to be used for a neural network configured to contribute to the downstream task.
Various embodiments may be implemented on any of a variety of commercially available computing systems and computing devices, such as a server 1700 an example of which is illustrated in
The processor 1701 may be any programmable microprocessor, microcomputer or multiple processor chip or chips that may be configured by software instructions (applications) to perform a variety of functions, including the functions of the various embodiments described in this application. In some wireless devices, multiple processors may be provided, such as one processor dedicated to wireless communication functions and one processor dedicated to running other applications. Typically, software applications may be stored on non-transitory processor-readable medium, such as a disk drive 1703, before the instructions are accessed and loaded into the processor. The processor 1701 may include internal memory sufficient to store the application software instructions.
Various embodiments illustrated and described are provided merely as examples to illustrate various features of the claims. However, features shown and described with respect to any given embodiment are not necessarily limited to the associated embodiment and may be used or combined with other embodiments that are shown and described. Further, the claims are not intended to be limited by any one example embodiment. For example, one or more of the operations of the methods may be substituted for or combined with one or more operations of the methods.
Implementation examples are described in the following paragraphs. While some of the following implementation examples are described in terms of example methods, further example implementations may include: the example methods discussed in the following paragraphs implemented by a computing device comprising a processor configured with processor-executable instructions to perform operations of the methods of the following implementation examples; the example methods discussed in the following paragraphs implemented by a computing device comprising means for performing functions of the methods of the following implementation examples; and the example methods discussed in the following paragraphs may be implemented as a non-transitory processor-readable storage medium having stored thereon processor-executable instructions configured to cause a processor in a computing device to perform the operations of the methods of the following implementation examples.
Example 1. A method implemented on a computing device for selecting a neural network for a hardware configuration, including using an accuracy predictor to select from a search space a neural network including a first plurality of the blockwise knowledge distillation trained search blocks, in which the accuracy predictor is built using blockwise knowledge distillation trained search blocks that were trained from the search space.
Example 2. The method of example 1, further including selecting a second plurality of the blockwise knowledge distillation trained search blocks based on criteria of predicted accuracy using the accuracy predictor and a cost function for implementing the second plurality of the blockwise knowledge distillation trained search blocks.
Example 3. The method of example 2, in which selecting the second plurality of the blockwise knowledge distillation trained search blocks may include using an evolutionary search to select the second plurality of the blockwise knowledge distillation trained search blocks.
Example 4. The method of example 3, in which the second plurality of the blockwise knowledge distillation trained search blocks are Pareto-optimal blockwise knowledge distillation trained search blocks.
Example 5. The method of any of examples 1-4, in which using the accuracy predictor to select from the search space the neural network includes selecting the first plurality of the blockwise knowledge distillation trained search blocks using a scenario-aware search to select the first plurality of the blockwise knowledge distillation trained search blocks.
Example 6. The method of any of examples 1-5, including: initializing the first plurality of the blockwise knowledge distillation trained search blocks using weights of the blockwise knowledge distillation trained search blocks; and fine-tuning the neural network using knowledge distillation.
Example 7. The method of examples 1-6, including: selecting a sub-set of neural networks of the search space, in which each neural network of the sub-set of neural networks may include blockwise knowledge distillation trained search blocks of the generated blockwise knowledge distillation trained search blocks; initializing the blockwise knowledge distillation trained search blocks of the sub-set of neural networks using weights of the blockwise knowledge distillation trained search blocks, and fine-tuning the sub-set of neural networks using knowledge distillation.
Example 8. The method of example 7, including: extracting a quality metric by using blockwise knowledge distillation to train the neural network blocks from the search space and extracting a target by fine-tuning the sub-set of neural networks using knowledge distillation, in which the accuracy predictor is built using a linear regression model from the quality metric to the target.
Example 9. The method of any of examples 1-8, in which using the accuracy predictor to select from the search space the neural network includes selecting the neural network of the search space based on a search of the blockwise knowledge distillation trained search blocks using a criterion of predicted accuracy using the accuracy predictor and a cost function for implementing blockwise knowledge distillation trained search blocks of the neural network, the method further including: initializing the second plurality of the blockwise knowledge distillation trained search blocks using weights of the blockwise knowledge distillation trained search blocks; and fine-tuning the neural network using knowledge distillation, to generate a distilled neural network.
Example 10. The method of any of examples 1-9, including: using blockwise knowledge distillation to train neural network blocks from an extended search space to generate blockwise knowledge distillation trained search blocks and quality metrics, using the accuracy predictor to predict accuracy of the extended search space, in which the accuracy predictor is built for the search space different from the extended search space.
Example 11. The method of any of claims 1-10, further including using blockwise knowledge distillation to train neural network blocks from a search space to generate the blockwise knowledge distillation trained search blocks.
The foregoing method descriptions and the process flow diagrams are provided merely as illustrative examples and are not intended to require or imply that the operations of various embodiments may be performed in the order presented. As will be appreciated by one of skill in the art the order of operations in the foregoing embodiments may be performed in any order. Words such as “thereafter,” “then,” “next,” etc. are not intended to limit the order of the operations; these words are used to guide the reader through the description of the methods. Further, any reference to claim elements in the singular, for example, using the articles “a,” “an,” or “the” is not to be construed as limiting the element to the singular.
Various illustrative logical blocks, modules, functionality components, circuits, and algorithm operations described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and operations have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such embodiment decisions should not be interpreted as causing a departure from the scope of the claims.
The hardware used to implement the various illustrative logics, logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but, in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Alternatively, some operations or methods may be performed by circuitry that is specific to a given function.
In one or more embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions or code on a non-transitory computer-readable medium or a non-transitory processor-readable medium. The operations of a method or algorithm disclosed herein may be embodied in a processor-executable software module that may reside on a non-transitory computer-readable or processor-readable storage medium. Non-transitory computer-readable or processor-readable storage media may be any storage media that may be accessed by a computer or a processor. By way of example but not limitation, such non-transitory computer-readable or processor-readable media may include RAM, ROM, EEPROM, FLASH memory, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to store desired program code in the form of instructions or data structures and that may be accessed by a computer. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of non-transitory computer-readable and processor-readable media. Additionally, the operations of a method or algorithm may reside as one or any combination or set of codes and/or instructions on a non-transitory processor-readable medium and/or computer-readable medium, which may be incorporated into a computer program product.
The preceding description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the claims. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and implementations without departing from the scope of the claims. Thus, the present disclosure is not intended to be limited to the embodiments and implementations described herein, but is to be accorded the widest scope consistent with the following claims and the principles and novel features disclosed herein.
This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/114,463 entitled “A Method For Automatically Designing Efficient Hardware-Aware Neural Networks For Visual Recognition Using Knowledge Distillation” filed Nov. 16, 2020, the entire contents of which are incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63114463 | Nov 2020 | US |