This disclosure relates to the field of artificial intelligence. More particularly, embodiments disclosed herein provide for optimization of a neural network (NN).
A neural network in a type of machine-learning model that mimics the manner in which a human brain operates, and comprises a collection of interconnected software and/or hardware processing nodes that are similar to neurons in the brain. Nodes receive input data and use their logic to determine what value or other information, if any, to pass forward to other nodes. A NN may be trained with labeled training data that allows the processing nodes to learn how to recognize particular objects, classify or cluster data, identify patterns, identify an entity that differs from a pattern, etc.
Nodes are organized into layers, with those in each layer receiving their input data from a preceding layer and feeding their output to a subsequent layer. The greater the number of layers and the number of nodes within a layer, the more powerful the NN becomes, but the complexity increases commensurately. For example, the greater the number of nodes a neural network contains, the longer it takes to properly train the network.
Traditionally, optimization of a neural network is a serial process that includes a design stage and a tuning stage that are mutually dependent. For example, in the design stage a machine-learning expert will first explore some number of NN model designs and train them for accuracy. Only after the expert selects a model will a machine-learning engineer begin efforts to tune the model (e.g., for speed and power). Based on the engineer's efforts, the expert may need to revise the model or the training method. Because the efforts of the expert and the engineer are dependent upon each other, the optimization process can take a significant length of time until a solution is obtained that satisfies applicable criteria, during which one or the other of the expert and engineer, and their resources, may be idle.
Thus, there is a need for a system and method for expediting the process of optimizing a neural network.
In some embodiments, systems and methods are provided for enabling independence between workflows involved in the process of optimizing a neural network (NN) or other machine-learning model. More particularly, machine-learning engineers determine which of multiple models or model variants successfully satisfy applicable hardware constraints (e.g., in terms of speed and power). Meanwhile, machine-learning experts train only the successful models and/or variants, and evaluate them for accuracy. Thus, optimization for accuracy and optimization for latency (or speed and power) can proceed in parallel.
Moreover, unlike traditional optimization techniques in which a single NN model is optimized first for accuracy and then for speed and power, in some embodiments disclosed herein multiple different models or model variants may be in flight simultaneously since the optimization process is bifurcated into two independent stages. In addition, instead of working with a static hardware model or architecture, in some embodiments the architecture may evolve in response to evaluation of the various models and model variants.
In these embodiments, multiple variants of a selected machine-learning model (e.g., a neural network) are derived by modifying the model in some unique manner (e.g., to eliminate nodes, to prune channels). Each variant may be optimized in some manner, via quantization for example, to reduce its complexity and/or footprint, and is compiled to produce a runtime artifact. The variants are then tested (e.g., for latency, speed, and/or power) on a selected hardware architecture mimicked by a selected set of embedded hardware devices.
Only those variants that satisfy specified criteria in the latency evaluation are subsequently trained and then tested for accuracy. In parallel with this accuracy evaluation, other variants of the same model, or some variants of a different model, may undergo preparation for and execution of the latency evaluation.
Results of the latency and accuracy evaluations are saved, perhaps to a knowledge database. The knowledge database can therefore serve as a data repository for users (e.g., machine-learning experts and/or engineers) to use to examine the results, select models or variants for continued testing, identify optimizations that are particularly effective (or ineffective), etc.
In some embodiments, an analytics module infers or predicts the results of one evaluation of a model variant or set of variants (e.g., accuracy evaluation) based on the results of one or more other evaluations (e.g., the latency evaluation).
The following description is presented to enable any person skilled in the art to make and use the disclosed embodiments, and is provided in the context of one or more particular applications and their requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the scope of those that are disclosed. Thus, the present invention or inventions are not intended to be limited to the embodiments shown, but rather are to be accorded the widest scope consistent with the disclosure.
In some embodiments, tools and processes are provided for efficiently optimizing the operation of a neural network (NN) for execution on a target set of hardware resources. These embodiments attempt to avoid a NN generation workflow in which one stage or operation is stalled while awaiting a result from another stage.
This standard approach thus creates and enforces a co-dependence between the ML expert and the ML engineer. As one result, training complex models on large datasets can take days (if not weeks) on a dedicated cluster of GPUs. In addition, because the ML engineer may have to wait for each trained model, hardware resources are idle and unproductive in between each iteration.
In contrast,
One challenge faced while optimizing a neural network involves convergence upon an optimal solution. In the standard approach shown in
In operation 302, neural network architecture selection occurs, which involves selecting target hardware resources, and/or configurations of such resources, for operation of the optimized neural network. Thus, a type of processor (e.g., CPU, DSP, GPU) may be selected, a memory configuration may be chosen, etc.
It should be noted that, when processing returns to operation 302 after failed evaluation of a model for accuracy or latency, for example, feedback associated with the failed evaluation may lead to a change in the neural network architecture and/or target hardware selection criteria. Specifically, the target hardware model may be based on immediate results of a recent evaluation and/or results of past evaluations.
Following operation 302, the process splits into parallel workflows corresponding to workflows 212, 214 of
In operation 310, one or more models or model variants are quantized (e.g., for size and speed of execution). In a first iteration of this operation, if no models are yet available that have been successfully evaluated for accuracy (and/or other performance metrics), a first set of tentative neural network model variants is assembled and quantized, wherein each variant is a derivative model configured from a base NN model architecture. For example, a model variant may be produced by strategically pruning a few channels to reduce the size of the model. In subsequent iterations, quantization may be performed only on model variants that have passed evaluation for accuracy (in the parallel workflow).
In operation 312, the quantized model/variant is compiled by generating a set of operations to run on the target hardware. The selected operations are based on the results of operation 310, whereby the models are processed (e.g., quantized) for size and speed of execution.
In operation 314, an executable runtime produced from operation is evaluated for latency. If it satisfies the applicable latency threshold and/or other performance metrics, it begins training via operation 320. In different embodiments, the runtime's performance evaluation may examine different metrics in addition to or instead of latency, such as inference speed, storage size, power, and memory bandwidth. The evaluation of operation 314 may be performed on target hardware.
In operation 320, training of one or more NN models or model variants occurs (e.g., by ML expert 202), as in workflow 212 of
In operation 322, the trained models/variants are evaluated for accuracy. Processing returns to operation 302 for those models that meet the accuracy threshold, accompanied by feedback from the evaluation, while others that do not pass may return to operation 320 or may be abandoned. When a given model or variant passes the evaluations in each workflow, it will pass to operation 330. In different embodiments, a trained model's evaluation may examine different metrics in addition to or instead of accuracy, such as training time and configuration (e.g., depth and/or width).
Thus, in the method depicted in
After passing evaluations in operations 314 and 322, in operation an optimized runtime is produced that has passed evaluation for all desired performance metrics, possibly in parallel.
A key aspect of the illustrated method is that the feedback from some or all evaluations helps drive the NN architecture selection. Thus, partial results are evaluated on the target hardware and then used to train selected models. In other words, the workflow is data-driven based on the current NN model/variant and previous results. In contrast, the traditional approach shown in
In contrast, and as shown in
Within system 500, model architectures (e.g., Yolo, SSD, EfficientDet) and optimization schemes (e.g., quantization, pruning, distillation) are selected in order to produce neural network model variants, by architecture selector 510 and optimizer 512, respectively. Dispatcher 514 selects model variants to process and evaluate, and notifies scheduler 516 accordingly. Scheduler 516 organizes and schedules the respective tasks on GPU cluster 404 (e.g., for training) or device farm 410 (e.g., for performance evaluation). Results of the tasks executed by scheduler 516 are stored in knowledge database (KDB) 520.
In addition to evaluation results (e.g., accuracy, latency, power, memory requirements, bandwidth, batch size, target hardware configuration), KDB may also store model configuration data (e.g., hyperparameters, tuning parameters, layers, executable code for executing a model). The knowledge database may therefore be used for various purposes, including interaction with users (e.g., machine-learning exports and/or engineers) via user interface 522.
In some embodiments, dispatcher 514 automatically proposes, generates, and partially evaluates model variants to populate the KDB with useful data over time. This automation may be achieved by any means of optimizing one or more user-specified target objectives (e.g., model performance, size, speed), such as random search, grid search, Bayesian optimization and genetic algorithms. One example of this automation is guided by a user-specified policy that governs the selection of new variants (e.g., to limit the search space).
Analytics module 530 draws upon data stored in KDB 520 to visualize evaluation results, display performances of different models/variants, and/or present other data. These data may illustratively assist in the selection and/or design of additional NN configurations. In some embodiments, analytics module 530 is used to predict and infer details that are not yet complete in a candidate neural network′ partial evaluation results. For example, in one iteration, a NN model variant may be optimized (quantized and compiled) and evaluated on the device farm for performance (e.g., speed, size, power), with these results being stored in the KDB. The results are only partial because they lack the accuracy component that is generated only after some training. Thus, the analytics module can be used to predict and infer incomplete results to help a user and the system guide the next workflow iteration. Semantic viewer 532 produces visualizations for users by translating between the data domain of KDB 530, which may feature high dimensionality, and a user domain in which the data may be displayed for human perception.
In some implementations, scheduler 516 maintains a priority list of jobs that may include quantize-compile-evaluate tasks (i.e., sequentially quantize, compile, then evaluate a specified model), train-evaluate tasks (i.e., train and then evaluate a specified model), and/or others. Scheduler 516 therefore keeps the hardware busy running batches of evaluations to gather partial results, and may reprioritize jobs to maintain efficiency of memory loading, caching, and processing of models and evaluation datasets.
Thus, as discussed previously, multiple NN model variants are simultaneously in flight at any given time, and may be evaluated to yield complete or partial results. KDB 520 is constantly updated with these results, which are used to help determine the next workflow iteration for training or optimizing an NN model. Also, changes to architectures may be specified in a structured fashion, for instance with a domain-specific language (DSL), to allow existing architectures (including empty or identity architectures) to be changed incrementally and/or recombined in such a way that the KDB captures sufficient data and relations to elicit knowledge. An example of this knowledge may be: “What changes in a model architecture would improve a target objective most across different model families?”
In some embodiments, analytics module 530 produces knowledge graphs, embeddings, and machine-learning tasks. Knowledge graphs are data structures consisting of nodes/entities and relations between them, wherein each node represents the results (partial or complete) of a neural network evaluation.
Embeddings 614, 616, and 618 of
Based on knowledge graphs and associated embeddings, machine-learning tasks find patterns in the embeddings to provide an inference to the partial results. Therefore, given a dataset, we can find the best model variant, optimization, and target hardware that could achieve some threshold accuracy, while remaining within a desired performance (speed, size, power) spectrum. The ML tasks illustratively consist of algorithms (e.g., decision trees) that can be used to make an inference on a projected value. This is useful when certain values are not available (e.g., because they have not yet been evaluated on target hardware).
Knowledge graphs and embeddings are built upon many iterations of evaluations, using collected performance and accuracy results. As more results are stored (e.g., as clustered partial results as shown in
Because of the high dimensionality of knowledge graph and embeddings data, the data must be projected into a lower-dimensional space for presentation to a human user. Semantic viewer 532 of system 500 of
Visualization 700 maps evaluations of neural network models/variants to a two-dimensional space in which the x-axis represents speed (e.g., in number of inferences per second) and the y-axis represents accuracy. Each graphed point represents one evaluation, for either performance (e.g., during a hardware exploration workflow) or accuracy (e.g., during a training workflow). Dotted lines reflect speed and accuracy thresholds sought for an optimized form of a given model or variant, and may be set by a user.
In the illustrated embodiments, the user visualizes results that are captured in batches on target hardware. The color, shape, shading, or other visual characteristic represents the age of the results (previous, recent, and predicted). For example, as shown in visualization 700, results may be separated into previous results 710, recent results 712, and predicted results 714. Previous results 710 are results that were already stored in a knowledge database (or other repository). Recent results 712 are results produced within current and recent workflow iterations, possibly within some discrete timeframe that differs from the timeframe associated with previous results 710. Predicted results 714 are inferred from one or more knowledge graphs, and represent where future results may be situated, based on previous results 710 and recent results 712. An interface may be provided to allow a user to select new experiments (e.g., model architectures and optimization parameters) to explore next, based on this visualization.
However, other visualizations are also possible. For example, in one embodiment, the user may choose to view models with similar architectures (e.g., Yolo) in terms of accuracy and memory size. This visualization can be useful to determine model capacity, and an annotated overlay may be provided to show number of parameters, model layers, training time, and/or other criteria.
Knowledge database 520 and analytics module 530 of system 500 provide a bridge between the machine-learning and embedded systems domains. Each domain uses a different vocabulary and syntax to describe essentially the same type of processing related to NN inference processing. For example, in the machine-learning domain, inference processing is defined as layered operators operating on tensors (e.g., conv2d refers to a two-dimensional convolution on input tensors and filter weights). In the embedded systems domain, inference processing is defined as multiply-accumulate (MACC) operations for convolution, as related to the hardware instruction set.
Thus, analogous to language translation, the vocabularies of the ML and embedded system domains are bridged, and evaluation results related to the domains can be presented through semantic viewer 532 using metrics from the ML domain or embedded system domain (e.g., as shown in
Given the ability of the KDB and analytics module to map disparate domains into a single construct, it is understood that the disclosed embodiments can be used for other purposes beyond NN model optimization within an efficient end-to-end MLOps workflow, particularly in realms where there is only partial data. For example, in cybersecurity and fraud detection, information such as phone calls, email, can be mapped to physical activities and locations in order to interpret data that is suspicious and malicious.
As shown in
In
Resulting optimized application 820 includes runtime engine 822, which comprises optimized inference runtime (OIR) 824 and the selected runtime services 826. OIR 824, which may be generated by a compiler, provides the basic functionality for running the NN model's inference computations (e.g., convolutions, activations, and/or other mathematical functions).
In different embodiments, runtime services library 816 provides different runtime services for linking with an OIR, but in some embodiments the services can be grouped into four categories: operational, security, query and error handling. Operational runtime services include (a) inference functions such as loading and unloading values into and out of memory, user-defined operations for pre- and/or post-processing (e.g., resizing or scaling input), and running a neural network inference, and (b) continual learning functions such as getting or setting a confidence score threshold and storing input data for later training (e.g., if results are above the confidence score threshold). With reference to system 500 of
Security runtime services include functions for validating the integrity of a model (e.g., with a CRC (Cyclic Redundancy Check) algorithm), decrypting and authenticating a watermark within a model, and validating authorization for enabling operation of a runtime engine. It should be noted that the parameters for NN models and variants can be stored in a shared library (e.g., in system 500), and accessed by dispatcher 514 when selected. Because measurements of model accuracy are statistical in nature, it can be important to verify that the model has not been altered or compromised. An integrity check can also be useful to check for cyber-attacks, wherein model parameters are altered to change the behavior of the model.
In some embodiments, packaging module 814 inserts a watermark signature within a model's parameters in a way that does not affect the model's operational performance. For example, if a compressed NN model is quantized to 7-bit precision, then the packaging module can insert 1-bit of watermark signature across each model parameter. For a model with 1 million parameters, the watermark signature can store at most 1 million bits. The watermark signature can also be selected based on latency for authenticating the watermark in the target hardware.
Query runtime services deal with model information, and may include functions for collecting runtime engine health information and/or other model metadata, collecting performance metrics (e.g., statistics related to memory, latency, accuracy), and obtaining diagnostic information related to debugging and/or profiling operation of a runtime engine. Scheduler 516 of system 500 can therefore query runtime services to obtain performance metrics to support NN model evaluation. Also, to support multiple variants, dispatcher 514 and scheduler 516 can use the query runtime services to scan a model's metadata, including model origin and UUID (e.g., universal unique identifier for a variant), and/or collect performance metrics, for example, how much time it took to complete the NN inference (e.g., the latency).
Error handling runtime services deal with error conditions and include functions for handling interrupts and/or error/timeout codes, such as by diagnosing related error conditions. Scheduler 516 may use an error handling runtime service if an exception or timeout occurred during operation or evaluation of a neural network model or model variant.
An environment in which one or more embodiments described above are executed may incorporate a general-purpose computer or a special-purpose device such as a hand-held computer or communication device. Some details of such devices (e.g., processor, memory, data storage, display) may be omitted for the sake of clarity. A component such as a processor or memory to which one or more tasks or functions are attributed may be a general component temporarily configured to perform the specified task or function, or may be a specific component manufactured to perform the task or function. The term “processor” as used herein refers to one or more electronic circuits, devices, chips, processing cores and/or other components configured to process data and/or computer program code.
Data structures and program code described in this detailed description are typically stored on a non-transitory computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. Non-transitory computer-readable storage media include, but are not limited to, volatile memory; non-volatile memory; electrical, magnetic, and optical storage devices such as disk drives, magnetic tape, CDs (compact discs) and DVDs (digital versatile discs or digital video discs), solid-state drives, and/or other non-transitory computer-readable media now known or later developed.
Methods and processes described in the detailed description can be embodied as code and/or data, which may be stored in a non-transitory computer-readable storage medium as described above. When a processor or computer system reads and executes the code and manipulates the data stored on the medium, the processor or computer system performs the methods and processes embodied as code and data structures and stored within the medium.
Furthermore, the methods and processes may be programmed into hardware modules such as, but not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), and other programmable-logic devices now known or hereafter developed. When such a hardware module is activated, it performs the methods and processes included within the module.
The foregoing embodiments have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit this disclosure to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. The scope is defined by the appended claims, not the preceding disclosure.
This application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Application Ser. No. 63/320,941, which was filed on Mar. 17, 2022 and is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63320941 | Mar 2022 | US |