This disclosure relates generally to Edge networking and, more particularly, to methods, systems, articles of manufacture and apparatus to optimize resources in edge networks.
In recent years, entities that service workload requests are chartered with the responsibility of distributing those workloads in a manner that satisfies client demands. In some environments, the underlying platform resources are known ahead of time. Edge computing resources are being utilized to a greater extent and the target computational devices are heterogeneous. Target computational devices may include CPUs, GPUs, FPGAs and/or other types of accelerators.
FIG. ID3_1 is a block diagram of an example environment including an example orchestrator and a plurality of multi-core computing nodes executing artificial intelligence (AI) models.
FIG. ID3_3 is a block diagram of the example orchestrator of FIG. ID3_1.
FIG. ID3_4 is a block diagram of an example multi-core computing node executing AI models and utilizing a controller.
FIG. ID3_5 is an example performance map of candidate models with varying cache size and memory bandwidth utilization.
FIG. ID3_6 is a block diagram of an example system flow for the multi-core computing node of FIG. ID3_1.
FIGS. ID3_7, ID3_8, and ID3_9 are flowcharts representative of example machine readable instructions that may be executed by example processor circuitry to implement the orchestrator of FIG. ID3_1.
FIG. ID4_A illustrates an example automated machine learning apparatus.
FIG. ID4_B illustrates an implementation of the example featurization search system of FIG. ID4_A.
FIG. ID4_C is an implementation of the example hardware platform of FIG. 4_A on which the example featurization search system of FIG. ID4_A can operate and/or be implemented.
FIG. ID4_D illustrates an example crowd-sourced deployment of a plurality of hardware platforms gathering data at a server.
FIGS. ID4_E-ID4_G are flowcharts representative of machine readable instructions which may be executed to implement all or part of the example featurization search system of FIGS. ID4_A-ID4_D.
ID5_A is an example framework to optimize a workload.
ID5_B is an example graph semantic embedding for a candidate graph of interest.
ID5_C is an example optimizer to optimize workloads in a heterogenous environment.
ID5_D through ID5_F are flowcharts representative of machine-readable instructions which may be executed to implement all or part of the example optimizer of FIG. ID5_C.
FIG. ID6_1A is an example framework of structure and machine-readable instructions which may be executed to implement a search flow for a model compression method.
FIG. ID6_1B illustrates example compression techniques.
FIG. ID6_1C is a block diagram of an example compression framework.
FIG. ID6_2 is block diagram of an example framework of structure and machine-readable instructions which may be executed to implement a search flow for a model compression method for three customers with different requirements.
FIG. ID6_3 is an example of three customers with different requirements.
FIG. ID6_4 illustrates example productivity improvements resulting from examples disclosed herein.
FIG. ID6_5 is an example of a generalized agent architecture for the scalable model compression method for optimal platform specialization.
FIG. ID6_6 is an example of transfer learning by reusing data of customers with different requirements to create a scalable model compression method for optimal platform specialization.
FIG. ID6_7 is an example of knowledge accumulation via central database.
FIG. ID7_1 is an example schematic illustration of a framework for generating and providing optimal quantization weights to deep learning models.
FIG. ID7_2 is an example overview of a framework of methods for optimizing quantization efforts.
FIG. ID7_3 is a flowchart representative of machine-readable instructions which may be executed to implement an apparatus and method for generating and providing optimal quantization weights to deep learning models.
FIG. ID7_4 is an example of a sample action space.
FIG. ID7_5 is an example of a pre-quantized model and quantized model.
FIG. ID7_6 is a flowchart representative of an example method of providing quantization weights to deep learning models.
FIG. ID11_A is block diagram of an example of a branch path between layers in a neural network according to an example.
FIG. ID11_B is a block diagram of an example of a branch path that excludes channels from consideration based on a pruning ratio constraint according to an example.
FIG. ID11_C is a block diagram of an example of a branch path that balances an accuracy constraint against a layer width loss according to an example.
FIG. ID11_D is a flowchart of an example of a method of conducting pruning operations in a semiconductor apparatus according to an example.
FIG. ID11_E is a flowchart of an example of a method of conducting an importance classification of context information according to an example.
FIG. ID11_F is a block diagram of an example of a computer vision system according to an example.
FIG. ID11_G is an illustration of an example of a semiconductor package apparatus according to an example.
FIG. ID11_H is a block diagram of an example of a processor according to an example.
FIG. ID14_1 is a schematic illustration of an example convolution operation using a static kernel.
FIG. ID14_2A is a schematic illustration of an example convolution operation using a dynamic kernel.
FIG. ID14_2B illustrates an example input image and a plurality of different filters.
FIG. ID14_3 is a block diagram of example machine learning trainer circuitry implemented in accordance with teachings of this disclosure for training an adaptive convolutional neural network (CNN).
FIG. ID14_4 is a flowchart representative of example machine-readable instructions which may be executed to implement the example machine learning trainer circuitry to train a CNN using dynamic kernels.
FIG. ID15_1 includes graphs illustrating iterations of the example process for finding roots of a function.
FIG. ID15_2 represents values of seven iterations of the root finding process associated with the polynomial function of FIG. ID15_1.
FIG. ID15_3 is a block diagram of an example neural network.
FIG. ID15_4 is a block diagram of an example machine learning trainer 400 implemented in accordance with teachings of this disclosure to perform second order bifurcating for training of a machine learning model.
FIG. ID15_5 is a flowchart representing example machine-readable instructions that may be executed to train a machine learning model using the roots identified by the root finder of FIG. ID15_2.
FIG. ID15_6 is a flowchart representing example machine-readable instructions that may be executed to cause the example root finder and/or root tester to find roots of a function.
FIG. ID15_7 is a flowchart representing example machine-readable instructions that may be executed to cause the example root finder to find roots of a function.
FIG. ID15_8 is a flowchart representing example mathematical operations corresponding to the machine-readable instructions of FIG. ID15_7.
FIG. ID15_10 represents an experiment to train a neural network to generate a particular output.
FIG. ID24_A1 is a block diagram of an example optimizer to generate neural network (NN) dictionaries.
FIG. ID24_A2 is a block diagram of two example tuning methodologies.
FIG. ID24_B1 is a block diagram of example dictionary key replacement.
FIG. ID24_B2 is a block diagram of example matrix progressions to generate a NN dictionary.
FIG. ID24_C illustrates an example compression process.
ID24_D illustrates an example linear memory and indexing method for on-device dictionaries.
ID24_E illustrates example matrix/tensor decompression during inference time.
ID24_F through ID24_I are flowcharts representing example machine-readable instructions that may be executed to generate dictionaries.
The figures are not to scale. Instead, the thickness of the layers or regions may be enlarged in the drawings. Although the figures show layers and regions with clean lines and boundaries, some or all of these lines and/or boundaries may be idealized. In reality, the boundaries and/or lines may be unobservable, blended, and/or irregular. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts. As used herein, unless otherwise stated, the term “above” describes the relationship of two parts relative to Earth. A first part is above a second part, if the second part has at least one part between Earth and the first part. Likewise, as used herein, a first part is “below” a second part when the first part is closer to the Earth than the second part. As noted above, a first part can be above or below a second part with one or more of: other parts therebetween, without other parts therebetween, with the first and second parts touching, or without the first and second parts being in direct contact with one another. As used in this patent, stating that any part (e.g., a layer, film, area, region, or plate) is in any way on (e.g., positioned on, located on, disposed on, or formed on, etc.) another part, indicates that the referenced part is either in contact with the other part, or that the referenced part is above the other part with one or more intermediate part(s) located therebetween. As used herein, connection references (e.g., attached, coupled, connected, and joined) may include intermediate members between the elements referenced by the connection reference and/or relative movement between those elements unless otherwise indicated. As such, connection references do not necessarily infer that two elements are directly connected and/or in fixed relation to each other. As used herein, stating that any part is in “contact” with another part is defined to mean that there is no intermediate part between the two parts.
Unless specifically stated otherwise, descriptors such as “first,” “second,” “third,” etc., are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name. As used herein, “approximately” and “about” refer to dimensions that may not be exact due to manufacturing tolerances and/or other real world imperfections. As used herein “substantially real time” refers to occurrence in a near instantaneous manner recognizing there may be real world delays for computing time, transmission, etc. Thus, unless otherwise specified, “substantially real time” refers to real time+/−1 second. As used herein, the phrase “in communication,” including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, and does not require direct physical (e.g., wired) communication and/or constant communication, but rather additionally includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or one-time events. As used herein, “processor circuitry” is defined to include (i) one or more special purpose electrical circuits structured to perform specific operation(s) and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors), and/or (ii) one or more general purpose semiconductor-based electrical circuits programmed with instructions to perform specific operations and including one or more semiconductor-based logic devices (e.g., electrical hardware implemented by one or more transistors). Examples of processor circuitry include programmed microprocessors, Field Programmable Gate Arrays (FPGAs) that may instantiate instructions, Central Processor Units (CPUs), Graphics Processor Units (GPUs), Digital Signal Processors (DSPs), XPUs, or microcontrollers and integrated circuits such as Application Specific Integrated Circuits (ASICs). For example, an XPU may be implemented by a heterogeneous computing system including multiple types of processor circuitry (e.g., one or more FPGAs, one or more CPUs, one or more GPUs, one or more DSPs, etc., and/or a combination thereof) and application programming interface(s) (API(s)) that may assign computing task(s) to whichever one(s) of the multiple types of the processing circuitry is/are best suited to execute the computing task(s).
Artificial intelligence (AI), including machine learning (ML), deep learning (DL), and/or other artificial machine-driven logic, enables machines (e.g., computers, logic circuits, etc.) to use a model to process input data to generate an output based on patterns and/or associations previously learned by the model via a training process. As used herein, data is information in any form that may be ingested, processed, interpreted and/or otherwise manipulated by processor circuitry to produce a result. The produced result may itself be data. As used herein, a model is a set of instructions and/or data that may be ingested, processed, interpreted and/or otherwise manipulated by processor circuitry to produce a result. Often, a model is operated using input data to produce output data in accordance with one or more relationships reflected in the model. The model may be based on training data. For instance, the model may be trained with data to recognize patterns and/or associations and follow such patterns and/or associations when processing input data such that other input(s) result in output(s) consistent with the recognized patterns and/or associations.
Many different types of machine learning models and/or machine learning architectures exist. In some non-limiting examples disclosed herein, a Neural Network (NN) model is used but examples disclosed herein are not limited thereto.
In general, implementing a ML/AI system involves at least two phases, a learning/training phase and an inference phase. In the learning/training phase, a training algorithm is used to train a model to operate in accordance with patterns and/or associations based on, for example, training data. In general, the model includes internal parameters that guide how input data is transformed into output data, such as through a series of nodes and connections within the model to transform input data into output data. In some examples, parameters are synonymous with metrics. In some examples, metrics include latency values (e.g., a duration of time in milliseconds), accuracy values (e.g., a percentage difference between a calculated value and a ground truth value (e.g., for a model)), etc. Additionally, hyperparameters are used as part of the training process to control how the learning is performed (e.g., a learning rate, a number of layers to be used in the machine learning model, etc.). Hyperparameters are defined to be training parameters that are determined prior to initiating the training process.
Different types of training may be performed based on the type of ML/AI model and/or the expected output. For example, supervised training uses inputs and corresponding expected (e.g., labeled) outputs to select parameters (e.g., by iterating over combinations of select parameters) for the ML/AI model that reduce model error. As used herein, labelling refers to an expected output of the machine learning model (e.g., a classification, an expected output value, etc.) Alternatively, unsupervised training (e.g., used in deep learning, a subset of machine learning, etc.) involves inferring patterns from inputs to select parameters for the ML/AJ model (e.g., without the benefit of expected (e.g., labeled) outputs). Unsupervised training may be particularly helpful in circumstances where patterns in the data are not known beforehand. Unsupervised learning can be helpful to identify details that can further lead to data characterization (e.g., identifying subtle patterns in the data).
In examples disclosed herein, ML/AI models are trained using any training algorithm. In examples disclosed herein, training is performed until convergence and/or a threshold error metric is measured. As used herein “threshold” is expressed as data such as a numerical value represented in any form, that may be used by processor circuitry as a reference for a comparison operation. In examples disclosed herein, training is performed at any location within the Edge network that has, for example, adequate processing capabilities, adequate power, and/or adequate memory. Training is performed using hyperparameters that control how the learning is performed (e.g., a learning rate, a number of layers to be used in the machine learning model, etc.). In some examples re-training may be performed. Such re-training may be performed in response to feedback metrics and/or error metrics.
Training is performed using training data. In examples disclosed herein, the training data originates from any source, such as historical data from previously executed processes. In some examples, because supervised training is used, the training data is labeled.
Once training is complete, the model is deployed for use as an executable construct that processes an input and provides an output based on the network of nodes and connections defined in the model. The model is stored at one or more locations of the Edge network, including servers, platforms and/or IoT devices. The model may then be executed by the Edge devices.
Once trained, the deployed model may be operated in an inference phase to process data. In the inference phase, data to be analyzed (e.g., live data) is input to the model, and the model executes to create an output. This inference phase can be thought of as the AI “thinking” to generate the output based on what it learned from the training (e.g., by executing the model to apply the learned patterns and/or associations to the live data). In some examples, input data undergoes pre-processing before being used as an input to the machine learning model. Moreover, in some examples, the output data may undergo post-processing after it is generated by the AI model to transform the output into a useful result (e.g., a display of data, an instruction to be executed by a machine, etc.).
In some examples, output of the deployed model may be captured and provided as feedback. By analyzing the feedback, an accuracy of the deployed model can be determined. If the feedback indicates that the accuracy of the deployed model is less than a threshold or other criterion, training of an updated model can be triggered using the feedback and an updated training data set, hyperparameters, etc., to generate an updated, deployed model.
Compute, memory, and storage are scarce resources, and generally decrease depending on the Edge location (e.g., fewer processing resources being available at consumer endpoint devices, than at a base station, than at a central office). However, the closer that the Edge location is to the endpoint (e.g., user equipment (UE)), the more that space and power is often constrained. Thus, Edge computing attempts to reduce the amount of resources needed for network services, through the distribution of more resources which are located closer both geographically and in network access time. In this manner, Edge computing attempts to bring the compute resources to workload data where appropriate, or, bring the workload data to the compute resources. In some examples, a workload includes, but is not limited to executable processes, such as algorithms, machine learning algorithms, image recognition algorithms, gain/loss algorithms, etc.
The following describes aspects of an Edge cloud architecture that covers multiple potential deployments and addresses restrictions that some network operators or service providers may have in their own infrastructures. These include, variation of configurations based on the Edge location (because edges at a base station level, for instance, may have more constrained performance and capabilities in a multi-tenant scenario); configurations based on the type of compute, memory, storage, fabric, acceleration, or like resources available to Edge locations, tiers of locations, or groups of locations; the service, security, and management and orchestration capabilities; and related objectives to achieve usability and performance of end services. These deployments may accomplish processing in network layers that may be considered as “near Edge”, “close Edge”, “local Edge”, “middle Edge”, or “far Edge” layers, depending on latency, distance, and timing characteristics.
Edge computing is a developing paradigm where computing is performed at or closer to the “Edge” of a network, typically through the use of a compute platform (e.g., x86 or ARM compute hardware architecture) implemented at base stations, gateways, network routers, or other devices which are much closer to endpoint devices producing and consuming the data. For example, Edge gateway servers may be equipped with pools of memory and storage resources to perform computation in real-time for low latency use-cases (e.g., autonomous driving or video surveillance) for connected client devices. Or as an example, base stations may be augmented with compute and acceleration resources to directly process service workloads for connected user equipment, without further communicating data via backhaul networks. Or as another example, central office network management hardware may be replaced with standardized compute hardware that performs virtualized network functions and offers compute resources for the execution of services and consumer functions for connected devices. Within Edge computing networks, there may be scenarios in services which the compute resource will be “moved” to the data, as well as scenarios in which the data will be “moved” to the compute resource. Or as an example, base station compute, acceleration and network resources can provide services in order to scale to workload demands on an as needed basis by activating dormant capacity (subscription, capacity on demand) in order to manage corner cases, emergencies or to provide longevity for deployed resources over a significantly longer implemented lifecycle.
Examples of latency, resulting from network communication distance and processing time constraints, may range from less than a millisecond (ms) when among the endpoint layer A200, under 5 ms at the Edge devices layer A210, to even between 10 to 40 ms when communicating with nodes at the network access layer A220. Beyond the Edge cloud A110 are core network A230 and cloud data center A240 layers, each with increasing latency (e.g., between 50-60 ms at the core network layer A230, to 100 or more ms at the cloud data center layer). As a result, operations at a core network data center A235 or a cloud data center A245, with latencies of at least 50 to 100 ms or more, will not be able to accomplish many time-critical functions of the use cases A205. Each of these latency values are provided for purposes of illustration and contrast; it will be understood that the use of other access network mediums and technologies may further reduce the latencies. In some examples, respective portions of the network may be categorized as “close Edge”, “local Edge”, “near Edge”, “middle Edge”, or “far Edge” layers, relative to a network source and destination. For instance, from the perspective of the core network data center A235 or a cloud data center A245, a central office or content data network may be considered as being located within a “near Edge” layer (“near” to the cloud, having high latency values when communicating with the devices and endpoints of the use cases A205), whereas an access point, base station, on-premise server, or network gateway may be considered as located within a “far Edge” layer (“far” from the cloud, having low latency values when communicating with the devices and endpoints of the use cases A205). It will be understood that other categorizations of a particular network layer as constituting a “close”, “local”, “near”, “middle”, or “far” Edge may be based on latency, distance, number of network hops, or other measurable characteristics, as measured from a source in any of the network layers A200-A240.
The various use cases A205 may access resources under usage pressure from incoming streams, due to multiple services utilizing the Edge cloud. To achieve results with low latency, the services executed within the Edge cloud A110 balance varying requirements in terms of: (a) Priority (throughput or latency) and Quality of Service (QoS) (e.g., traffic for an autonomous car may have higher priority than a temperature sensor in terms of response time requirement; or, a performance sensitivity/bottleneck may exist at a compute/accelerator, memory, storage, or network resource, depending on the application); (b) Reliability and Resiliency (e.g., some input streams need to be acted upon and the traffic routed with mission-critical reliability, where as some other input streams may be tolerate an occasional failure, depending on the application); and (c) Physical constraints (e.g., power, cooling and form-factor).
The end-to-end service view for these use cases involves the concept of a service-flow and is associated with a transaction. The transaction details the overall service requirement for the entity consuming the service, as well as the associated services for the resources, workloads, workflows, and business functional and business level requirements. The services executed with the “terms” described may be managed at each layer in a way to assure real time, and runtime contractual compliance for the transaction during the lifecycle of the service. When a component in the transaction is missing its agreed to service level agreement (SLA), the system as a whole (components in the transaction) may provide the ability to (1) understand the impact of the SLA violation, and (2) augment other components in the system to resume overall transaction SLA, and (3) implement steps to remediate. In some examples, an SLA is an agreement, commitment and/or contract between entities. The SLA may include parameters (e.g., latency) and corresponding values (e.g., time in milliseconds) that must be satisfied before the SLA is deemed in compliance or not.
Thus, with these variations and service features in mind, Edge computing within the Edge cloud A110 may provide the ability to serve and respond to multiple applications of the use cases A205 (e.g., object tracking, video surveillance, connected cars, etc.) in real-time or near real-time, and meet ultra-low latency requirements for these multiple applications. These advantages enable a whole new class of applications (Virtual Network Functions (VNFs), Function as a Service (FaaS), Edge as a Service (EaaS), standard processes, etc.), which cannot leverage conventional cloud computing due to latency or other limitations.
However, with the advantages of Edge computing comes the following caveats. The devices located at the Edge are often resource constrained and therefore there is pressure on usage of Edge resources. Typically, this is addressed through the pooling of memory and storage resources for use by multiple users (tenants) and devices. The Edge may be power and cooling constrained and therefore the power usage needs to be accounted for by the applications that are consuming the most power. There may be inherent power-performance tradeoffs in these pooled memory resources, as many of them are likely to use emerging memory technologies, where more power requires greater memory bandwidth. Likewise, improved security of hardware and root of trust trusted functions are also required, because Edge locations may be unmanned and may even need permissioned access (e.g., when housed in a third-party location). Such issues are magnified in the Edge cloud A110 in a multi-tenant, multi-owner, or multi-access setting, where services and applications are requested by many users, especially as network usage dynamically fluctuates and the composition of the multiple stakeholders, use cases, and services changes.
At a more generic level, an Edge computing system may be described to encompass any number of deployments at the previously discussed layers operating in the Edge cloud A110 (network layers A200-A240), which provide coordination from client and distributed computing devices. One or more Edge gateway nodes, one or more Edge aggregation nodes, and one or more core data centers may be distributed across layers of the network to provide an implementation of the Edge computing system by or on behalf of a telecommunication service provider (“telco”, or “TSP”), internet-of-things service provider, cloud service provider (CSP), enterprise entity, or any other number of entities. Various implementations and configurations of the Edge computing system may be provided dynamically, such as when orchestrated to meet service objectives.
Consistent with the examples provided herein, a client compute node may be embodied as any type of endpoint component, device, appliance, or other thing capable of communicating as a producer or consumer of data. Further, the label “node” or “device” as used in the Edge computing system does not necessarily mean that such node or device operates in a client or agent/minion/follower role; rather, any of the nodes or devices in the Edge computing system refer to individual entities, nodes, or subsystems which include discrete or connected hardware or software configurations to facilitate or use the Edge cloud A110.
As such, the Edge cloud A110 is formed from network components and functional features operated by and within Edge gateway nodes, Edge aggregation nodes, or other Edge compute nodes among network layers A210-A230. The Edge cloud A110 thus may be embodied as any type of network that provides Edge computing and/or storage resources which are proximately located to radio access network (RAN) capable endpoint devices (e.g., mobile computing devices, IoT devices, smart devices, etc.), which are discussed herein. In other words, the Edge cloud A110 may be envisioned as an “Edge” which connects the endpoint devices and traditional network access points that serve as an ingress point into service provider core networks, including mobile carrier networks (e.g., Global System for Mobile Communications (GSM) networks, Long-Term Evolution (LTE) networks, 5G/6G networks, etc.), while also providing storage and/or compute capabilities. Other types and forms of network access (e.g., Wi-Fi, long-range wireless, wired networks including optical networks) may also be utilized in place of or in combination with such 3GPP carrier networks.
The network components of the Edge cloud A110 may be servers, multi-tenant servers, appliance computing devices, and/or any other type of computing devices. For example, the Edge cloud A110 may include an appliance computing device that is a self-contained electronic device including a housing, a chassis, a case or a shell. In some circumstances, the housing may be dimensioned for portability such that it can be carried by a human and/or shipped. Example housings may include materials that form one or more exterior surfaces that partially or fully protect contents of the appliance, in which protection may include weather protection, hazardous environment protection (e.g., EMI, vibration, extreme temperatures), and/or enable submergibility. Example housings may include power circuitry to provide power for stationary and/or portable implementations, such as AC power inputs, DC power inputs, AC/DC or DC/AC converter(s), power regulators, transformers, charging circuitry, batteries, wired inputs and/or wireless power inputs. Example housings and/or surfaces thereof may include or connect to mounting hardware to enable attachment to structures such as buildings, telecommunication structures (e.g., poles, antenna structures, etc.) and/or racks (e.g., server racks, blade mounts, etc.). Example housings and/or surfaces thereof may support one or more sensors (e.g., temperature sensors, vibration sensors, light sensors, acoustic sensors, capacitive sensors, proximity sensors, etc.). One or more such sensors may be contained in, carried by, or otherwise embedded in the surface and/or mounted to the surface of the appliance. Example housings and/or surfaces thereof may support mechanical connectivity, such as propulsion hardware (e.g., wheels, propellers, etc.) and/or articulating hardware (e.g., robot arms, pivotable appendages, etc.). In some circumstances, the sensors may include any type of input devices such as user interface hardware (e.g., buttons, switches, dials, sliders, etc.). In some circumstances, example housings include output devices contained in, carried by, embedded therein and/or attached thereto. Output devices may include displays, touchscreens, lights, LEDs, speakers, I/O ports (e.g., USB), etc. In some circumstances, Edge devices are devices presented in the network for a specific purpose (e.g., a traffic light), but may have processing and/or other capacities that may be utilized for other purposes. Such Edge devices may be independent from other networked devices and may be provided with a housing having a form factor suitable for its primary purpose; yet be available for other compute tasks that do not interfere with its primary task. Edge devices include Internet of Things devices. The appliance computing device may include hardware and software components to manage local issues such as device temperature, vibration, resource utilization, updates, power issues, physical and network security, etc. Example hardware for implementing an appliance computing device is described in conjunction with
In
In further examples, any of the compute nodes or devices discussed with reference to the present Edge computing systems and environment may be fulfilled based on the components depicted in
In the simplified example depicted in
The compute node D100 may be embodied as any type of engine, device, or collection of devices capable of performing various compute functions. In some examples, the compute node D100 may be embodied as a single device such as an integrated circuit, an embedded system, a field-programmable gate array (FPGA), a system-on-a-chip (SOC), or other integrated system or device. In the illustrative example, the compute node D100 includes or is embodied as a processor D104 and a memory D106. The processor D104 may be embodied as any type of processor capable of performing the functions described herein (e.g., executing an application). For example, the processor D104 may be embodied as a multi-core processor(s), a microcontroller, a processing unit, a specialized or special purpose processing unit, or other processor or processing/controlling circuit.
In some examples, the processor D104 may be embodied as, include, or be coupled to an FPGA, an application specific integrated circuit (ASIC), reconfigurable hardware or hardware circuitry, or other specialized hardware to facilitate performance of the functions described herein. Also in some examples, the processor D104 may be embodied as a specialized x-processing unit (xPU) also known as a data processing unit (DPU), infrastructure processing unit (IPU), or network processing unit (NPU). Such an xPU may be embodied as a standalone circuit or circuit package, integrated within an SOC, or integrated with networking circuitry (e.g., in a SmartNIC, or enhanced SmartNIC), acceleration circuitry, storage devices, storage disks, or AI hardware (e.g., GPUs or programmed FPGAs). Such an xPU may be designed to receive programming to process one or more data streams and perform specific tasks and actions for the data streams (such as hosting microservices, performing service management or orchestration, organizing or managing server or data center hardware, managing service meshes, or collecting and distributing telemetry), outside of the CPU or general purpose processing hardware. However, it will be understood that a xPU, a SOC, a CPU, and other variations of the processor D104 may work in coordination with each other to execute many types of operations and instructions within and on behalf of the compute node D100.
The memory D106 may be embodied as any type of volatile (e.g., dynamic random access memory (DRAM), etc.) or non-volatile memory or data storage capable of performing the functions described herein. Volatile memory may be a storage medium that requires power to maintain the state of data stored by the medium. Non-limiting examples of volatile memory may include various types of random access memory (RAM), such as DRAM or static random access memory (SRAM). One particular type of DRAM that may be used in a memory module is synchronous dynamic random access memory (SDRAM).
In an example, the memory device is a block addressable memory device, such as those based on NAND or NOR technologies (for example, Single-Level Cell (“SLC”), Multi-Level Cell (“MLC”), Quad-Level Cell (“QLC”), Tri-Level Cell (“TLC”), or some other NAND). In some examples, the memory device includes a byte-addressable write-in-place three dimensional crosspoint memory device, or other byte addressable write-in-place non-volatile memory (NVM) devices, such as single or multi-level Phase Change Memory (PCM) or phase change memory with a switch (PCMS), NVM devices that use chalcogenide phase change material (for example, chalcogenide glass), resistive memory including metal oxide base, oxygen vacancy base and Conductive Bridge Random Access Memory (CB-RAM), nanowire memory, ferroelectric transistor random access memory (FeTRAM), magneto resistive random access memory (MRAM) that incorporates memristor technology, spin transfer torque (STT)-MRAM, a spintronic magnetic junction memory based device, a magnetic tunneling junction (MTJ) based device, a DW (Domain Wall) and SOT (Spin Orbit Transfer) based device, a thyristor based memory device, a combination of any of the above, or other suitable memory. A memory device may also include a three dimensional crosspoint memory device (e.g., Intel® 3D XPoint™ memory), or other byte addressable write-in-place nonvolatile memory devices. The memory device may refer to the die itself and/or to a packaged memory product. In some examples, 3D crosspoint memory (e.g., Intel® 3D XPoint™ memory) may comprise a transistor-less stackable cross point architecture in which memory cells sit at the intersection of word lines and bit lines and are individually addressable and in which bit storage is based on a change in bulk resistance. In some examples, all or a portion of the memory D106 may be integrated into the processor D104. The memory D106 may store various software and data used during operation such as one or more applications, data operated on by the application(s), libraries, and drivers.
In some examples, resistor-based and/or transistor-less memory architectures include nanometer scale phase-change memory (PCM) devices in which a volume of phase-change material resides between at least two electrodes. Portions of the example phase-change material exhibit varying degrees of crystalline phases and amorphous phases, in which varying degrees of resistance between the at least two electrodes can be measured. In some examples, the phase-change material is a chalcogenide-based glass material. Such resistive memory devices are sometimes referred to as memristive devices that remember the history of the current that previously flowed through them. Stored data is retrieved from example PCM devices by measuring the electrical resistance, in which the crystalline phases exhibit a relatively lower resistance value(s) (e.g., logical “0”) when compared to the amorphous phases having a relatively higher resistance value(s) (e.g., logical “1”).
Example PCM devices store data for long periods of time (e.g., approximately 10 years at room temperature). Write operations to example PCM devices (e.g., set to logical “0”, set to logical “1”, set to an intermediary resistance value) are accomplished by applying one or more current pulses to the at least two electrodes, in which the pulses have a particular current magnitude and duration. For instance, a long low current pulse (SET) applied to the at least two electrodes causes the example PCM device to reside in a low-resistance crystalline state, while a comparatively short high current pulse (RESET) applied to the at least two electrodes causes the example PCM device to reside in a high-resistance amorphous state.
In some examples, implementation of PCM devices facilitates non-von Neumann computing architectures that enable in-memory computing capabilities. Generally speaking, traditional computing architectures include a central processing unit (CPU) communicatively connected to one or more memory devices via a bus. As such, a finite amount of energy and time is consumed to transfer data between the CPU and memory, which is a known bottleneck of von Neumann computing architectures. However, PCM devices minimize and, in some cases, eliminate data transfers between the CPU and memory by performing some computing operations in-memory. Stated differently, PCM devices both store information and execute computational tasks. Such non-von Neumann computing architectures may implement vectors having a relatively high dimensionality to facilitate hyperdimensional computing, such as vectors having 10,000 bits. Relatively large bit width vectors enable computing paradigms modeled after the human brain, which also processes information analogous to wide bit vectors.
The compute circuitry D102 is communicatively coupled to other components of the compute node D100 via the I/O subsystem D108, which may be embodied as circuitry and/or components to facilitate input/output operations with the compute circuitry D102 (e.g., with the processor D104 and/or the main memory D106) and other components of the compute circuitry D102. For example, the I/O subsystem D108 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, integrated sensor hubs, firmware devices, communication links (e.g., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc.), and/or other components and subsystems to facilitate the input/output operations. In some examples, the I/O subsystem D108 may form a portion of a system-on-a-chip (SoC) and be incorporated, along with one or more of the processor D104, the memory D106, and other components of the compute circuitry D102, into the compute circuitry D102.
The one or more illustrative data storage devices/disks D110 may be embodied as one or more of any type(s) of physical device(s) configured for short-term or long-term storage of data such as, for example, memory devices, memory, circuitry, memory cards, flash memory, hard disk drives, solid-state drives (SSDs), and/or other data storage devices/disks. Individual data storage devices/disks D110 may include a system partition that stores data and firmware code for the data storage device/disk D110. Individual data storage devices/disks D110 may also include one or more operating system partitions that store data files and executables for operating systems depending on, for example, the type of compute node D100.
The communication circuitry D112 may be embodied as any communication circuit, device, or collection thereof, capable of enabling communications over a network between the compute circuitry D102 and another compute device (e.g., an Edge gateway of an implementing Edge computing system). The communication circuitry D112 may be configured to use any one or more communication technology (e.g., wired or wireless communications) and associated protocols (e.g., a cellular networking protocol such a 3GPP 4G or 5G standard, a wireless local area network protocol such as IEEE 802.11/Wi-Fi®, a wireless wide area network protocol, Ethernet, Bluetooth®, Bluetooth Low Energy, a IoT protocol such as IEEE 802.15.4 or ZigBee®, low-power wide-area network (LPWAN) or low-power wide-area (LPWA) protocols, etc.) to effect such communication.
The illustrative communication circuitry D112 includes a network interface controller (NIC) D120, which may also be referred to as a host fabric interface (HFI). The NIC D120 may be embodied as one or more add-in-boards, daughter cards, network interface cards, controller chips, chipsets, or other devices that may be used by the compute node D100 to connect with another compute device (e.g., an Edge gateway node). In some examples, the NIC D120 may be embodied as part of a system-on-a-chip (SoC) that includes one or more processors, or included on a multichip package that also contains one or more processors. In some examples, the NIC D120 may include a local processor (not shown) and/or a local memory (not shown) that are both local to the NIC D120. In such examples, the local processor of the NIC D120 may be capable of performing one or more of the functions of the compute circuitry D102 described herein. Additionally, or alternatively, in such examples, the local memory of the NIC D120 may be integrated into one or more components of the client compute node at the board level, socket level, chip level, and/or other levels.
Additionally, in some examples, a respective compute node D100 may include one or more peripheral devices D114. Such peripheral devices D114 may include any type of peripheral device found in a compute device or server such as audio input devices, a display, other input/output devices, interface devices, and/or other peripheral devices, depending on the particular type of the compute node D100. In further examples, the compute node D100 may be embodied by a respective Edge compute node (whether a client, gateway, or aggregation node) in an Edge computing system or like forms of appliances, computers, subsystems, circuitry, or other components.
In a more detailed example,
The Edge computing device D150 may include processing circuitry in the form of a processor D152, which may be a microprocessor, a multi-core processor, a multithreaded processor, an ultra-low voltage processor, an embedded processor, an xPU/DPU/IPU/NPU, special purpose processing unit, specialized processing unit, or other known processing elements. The processor D152 may be a part of a system on a chip (SoC) in which the processor D152 and other components are formed into a single integrated circuit, or a single package, such as the Edison™ or Galileo™ SoC boards from Intel Corporation, Santa Clara, California. As an example, the processor D152 may include an Intel® Architecture Core™ based CPU processor, such as a Quark™, an Atom™, an i3, an i5, an i7, an i9, or an MCU-class processor, or another such processor available from Intel®. However, any number other processors may be used, such as available from Advanced Micro Devices, Inc. (AMD®) of Sunnyvale, California, a MIPS®-based design from MIPS Technologies, Inc. of Sunnyvale, California, an ARM®-based design licensed from ARM Holdings, Ltd. or a customer thereof, or their licensees or adopters. The processors may include units such as an A5-A13 processor from Apple® Inc., a Snapdragon™ processor from Qualcomm® Technologies, Inc., or an OMAP™ processor from Texas Instruments, Inc. The processor D152 and accompanying circuitry may be provided in a single socket form factor, multiple socket form factor, or a variety of other formats, including in limited hardware configurations or configurations that include fewer than all elements shown in
The processor D152 may communicate with a system memory D154 over an interconnect D156 (e.g., a bus). Any number of memory devices may be used to provide for a given amount of system memory. As examples, the memory 754 may be random access memory (RAM) in accordance with a Joint Electron Devices Engineering Council (JEDEC) design such as the DDR or mobile DDR standards (e.g., LPDDR, LPDDR2, LPDDR3, or LPDDR4). In particular examples, a memory component may comply with a DRAM standard promulgated by JEDEC, such as JESD79F for DDR SDRAM, JESD79-2F for DDR2 SDRAM, JESD79-3F for DDR3 SDRAM, JESD79-4A for DDR4 SDRAM, JESD209 for Low Power DDR (LPDDR), JESD209-2 for LPDDR2, JESD209-3 for LPDDR3, and JESD209-4 for LPDDR4. Such standards (and similar standards) may be referred to as DDR-based standards and communication interfaces of the storage devices that implement such standards may be referred to as DDR-based interfaces. In various implementations, the individual memory devices may be of any number of different package types such as single die package (SDP), dual die package (DDP) or quad die package (Q17P). These devices, in some examples, may be directly soldered onto a motherboard to provide a lower profile solution, while in other examples the devices are configured as one or more memory modules that in turn couple to the motherboard by a given connector. Any number of other memory implementations may be used, such as other types of memory modules, e.g., dual inline memory modules (DIMMs) of different varieties including but not limited to microDIMMs or MiniDIMMs.
To provide for persistent storage of information such as data, applications, operating systems and so forth, a storage D158 may also couple to the processor D152 via the interconnect D156. In an example, the storage D158 may be implemented via a solid-state disk drive (SSDD). Other devices that may be used for the storage D158 include flash memory cards, such as Secure Digital (SD) cards, microSD cards, eXtreme Digital (XD) picture cards, and the like, and Universal Serial Bus (USB) flash drives. In an example, the memory device may be or may include memory devices that use chalcogenide glass, multi-threshold level NAND flash memory, NOR flash memory, single or multi-level Phase Change Memory (PCM), a resistive memory, nanowire memory, ferroelectric transistor random access memory (FeTRAM), anti-ferroelectric memory, magnetoresistive random access memory (MRAM) memory that incorporates memristor technology, resistive memory including the metal oxide base, the oxygen vacancy base and the conductive bridge Random Access Memory (CB-RAM), or spin transfer torque (STT)-MRAM, a spintronic magnetic junction memory based device, a magnetic tunneling junction (MTJ) based device, a DW (Domain Wall) and SOT (Spin Orbit Transfer) based device, a thyristor based memory device, or a combination of any of the above, or other memory.
In low power implementations, the storage D158 may be on-die memory or registers associated with the processor D152. However, in some examples, the storage D158 may be implemented using a micro hard disk drive (HDD). Further, any number of new technologies may be used for the storage D158 in addition to, or instead of, the technologies described, such resistance change memories, phase change memories, holographic memories, or chemical memories, among others.
The components may communicate over the interconnect D156. The interconnect D156 may include any number of technologies, including industry standard architecture (ISA), extended ISA (EISA), peripheral component interconnect (PCI), peripheral component interconnect extended (PCIx), PCI express (PCIe), or any number of other technologies. The interconnect D156 may be a proprietary bus, for example, used in an SoC based system. Other bus systems may be included, such as an Inter-Integrated Circuit (I2C) interface, a Serial Peripheral Interface (SPI) interface, point to point interfaces, and a power bus, among others.
The interconnect D156 may couple the processor D152 to a transceiver D166, for communications with the connected Edge devices D162. The transceiver D166 may use any number of frequencies and protocols, such as 2.4 Gigahertz (GHz) transmissions under the IEEE 802.15.4 standard, using the Bluetooth® low energy (BLE) standard, as defined by the Bluetooth® Special Interest Group, or the ZigBee® standard, among others. Any number of radios, configured for a particular wireless communication protocol, may be used for the connections to the connected Edge devices D162. For example, a wireless local area network (WLAN) unit may be used to implement Wi-Fi® communications in accordance with the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standard. In addition, wireless wide area communications, e.g., according to a cellular or other wireless wide area protocol, may occur via a wireless wide area network (WWAN) unit.
The wireless network transceiver D166 (or multiple transceivers) may communicate using multiple standards or radios for communications at a different range. For example, the Edge computing node D150 may communicate with close devices, e.g., within about 10 meters, using a local transceiver based on Bluetooth Low Energy (BLE), or another low power radio, to save power. More distant connected Edge devices D162, e.g., within about 50 meters, may be reached over ZigBee® or other intermediate power radios. Both communications techniques may take place over a single radio at different power levels or may take place over separate transceivers, for example, a local transceiver using BLE and a separate mesh transceiver using ZigBee®.
A wireless network transceiver D166 (e.g., a radio transceiver) may be included to communicate with devices or services in a cloud (e.g., an Edge cloud D195) via local or wide area network protocols. The wireless network transceiver D166 may be a low-power wide-area (LPWA) transceiver that follows the IEEE 802.15.4, or IEEE 802.15.4g standards, among others. The Edge computing node D150 may communicate over a wide area using LoRaWAN™ (Long Range Wide Area Network) developed by Semtech and the LoRa Alliance. The techniques described herein are not limited to these technologies but may be used with any number of other cloud transceivers that implement long range, low bandwidth communications, such as Sigfox, and other technologies. Further, other communications techniques, such as time-slotted channel hopping, described in the IEEE 802.15.4e specification may be used.
Any number of other radio communications and protocols may be used in addition to the systems mentioned for the wireless network transceiver D166, as described herein. For example, the transceiver D166 may include a cellular transceiver that uses spread spectrum (SPA/SAS) communications for implementing high-speed communications. Further, any number of other protocols may be used, such as Wi-Fi® networks for medium speed communications and provision of network communications. The transceiver D166 may include radios that are compatible with any number of 3GPP (Third Generation Partnership Project) specifications, such as Long Term Evolution (LTE) and 5th Generation (5G) communication systems, discussed in further detail at the end of the present disclosure. A network interface controller (NIC) D168 may be included to provide a wired communication to nodes of the Edge cloud D195 or to other devices, such as the connected Edge devices D162 (e.g., operating in a mesh). The wired communication may provide an Ethernet connection or may be based on other types of networks, such as Controller Area Network (CAN), Local Interconnect Network (LIN), DeviceNet, ControlNet, Data Highway+, PROFIBUS, or PROFINET, among many others. An additional NIC D168 may be included to enable connecting to a second network, for example, a first NIC D168 providing communications to the cloud over Ethernet, and a second NIC D168 providing communications to other devices over another type of network.
Given the variety of types of applicable communications from the device to another component or network, applicable communications circuitry used by the device may include or be embodied by any one or more of components D164, D166, D168, or D170. Accordingly, in various examples, applicable means for communicating (e.g., receiving, transmitting, etc.) may be embodied by such communications circuitry.
The Edge computing node D150 may include or be coupled to acceleration circuitry D164, which may be embodied by one or more artificial intelligence (AI) accelerators, a neural compute stick, neuromorphic hardware, an FPGA, an arrangement of GPUs, an arrangement of xPUs/DPUs/IPU/NPUs, one or more SoCs, one or more CPUs, one or more digital signal processors, dedicated ASICs, or other forms of specialized processors or circuitry designed to accomplish one or more specialized tasks. These tasks may include AI processing (including machine learning, training, inferencing, and classification operations), visual data processing, network data processing, object detection, rule analysis, or the like. These tasks also may include the specific Edge computing tasks for service management and service operations discussed elsewhere in this document.
The interconnect D156 may couple the processor D152 to a sensor hub or external interface D170 that is used to connect additional devices or subsystems. The devices may include sensors D172, such as accelerometers, level sensors, flow sensors, optical light sensors, camera sensors, temperature sensors, global navigation system (e.g., GPS) sensors, pressure sensors, barometric pressure sensors, and the like. The hub or interface D170 further may be used to connect the Edge computing node D150 to actuators D174, such as power switches, valve actuators, an audible sound generator, a visual warning device, and the like.
In some optional examples, various input/output (I/O) devices may be present within or connected to, the Edge computing node D150. For example, a display or other output device D184 may be included to show information, such as sensor readings or actuator position. An input device D186, such as a touch screen or keypad may be included to accept input. An output device D184 may include any number of forms of audio or visual display, including simple visual outputs such as binary status indicators (e.g., light-emitting diodes (LEDs)) and multi-character visual outputs, or more complex outputs such as display screens (e.g., liquid crystal display (LCD) screens), with the output of characters, graphics, multimedia objects, and the like being generated or produced from the operation of the Edge computing node D150. A display or console hardware, in the context of the present system, may be used to provide output and receive input of an Edge computing system; to manage components or services of an Edge computing system; identify a state of an Edge computing component or service; or to conduct any other number of management or administration functions or service use cases.
A battery D176 may power the Edge computing node D150, although, in examples in which the Edge computing node D150 is mounted in a fixed location, it may have a power supply coupled to an electrical grid, or the battery may be used as a backup or for temporary capabilities. The battery D176 may be a lithium ion battery, or a metal-air battery, such as a zinc-air battery, an aluminum-air battery, a lithium-air battery, and the like.
A battery monitor/charger D178 may be included in the Edge computing node D150 to track the state of charge (SoCh) of the battery D176, if included. The battery monitor/charger D178 may be used to monitor other parameters of the battery D176 to provide failure predictions, such as the state of health (SoH) and the state of function (SoF) of the battery D176. The battery monitor/charger D178 may include a battery monitoring integrated circuit, such as an LTC4020 or an LTC2990 from Linear Technologies, an ADT7488A from ON Semiconductor of Phoenix Arizona, or an IC from the UCD90xxx family from Texas Instruments of Dallas, TX. The battery monitor/charger D178 may communicate the information on the battery D176 to the processor D152 over the interconnect D156. The battery monitor/charger D178 may also include an analog-to-digital (ADC) converter that enables the processor D152 to directly monitor the voltage of the battery D176 or the current flow from the battery D176. The battery parameters may be used to determine actions that the Edge computing node D150 may perform, such as transmission frequency, mesh network operation, sensing frequency, and the like.
A power block D180, or other power supply coupled to a grid, may be coupled with the battery monitor/charger D178 to charge the battery D176. In some examples, the power block D180 may be replaced with a wireless power receiver to obtain the power wirelessly, for example, through a loop antenna in the Edge computing node D150. A wireless battery charging circuit, such as an LTC4020 chip from Linear Technologies of Milpitas, California, among others, may be included in the battery monitor/charger D178. The specific charging circuits may be selected based on the size of the battery D176, and thus, the current required. The charging may be performed using the Airfuel standard promulgated by the Airfuel Alliance, the Qi wireless charging standard promulgated by the Wireless Power Consortium, or the Rezence charging standard, promulgated by the Alliance for Wireless Power, among others.
The storage D158 may include instructions D182 in the form of software, firmware, or hardware commands to implement the techniques described herein. Although such instructions D182 are shown as code blocks included in the memory D154 and the storage D158, it may be understood that any of the code blocks may be replaced with hardwired circuits, for example, built into an application specific integrated circuit (ASIC).
In an example, the instructions D182 provided via the memory D154, the storage D158, or the processor D152 may be embodied as a non-transitory, machine-readable medium D160 including code to direct the processor D152 to perform electronic operations in the Edge computing node D150. The processor D152 may access the non-transitory, machine-readable medium D160 over the interconnect D156. For instance, the non-transitory, machine-readable medium D160 may be embodied by devices described for the storage D158 or may include specific storage units such as storage devices and/or storage disks that include optical disks (e.g., digital versatile disk (DVD), compact disk (CD), CD-ROM, Blu-ray disk), flash drives, floppy disks, hard drives (e.g., SSDs), or any number of other hardware devices in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or caching). The non-transitory, machine-readable medium D160 may include instructions to direct the processor D152 to perform a specific sequence or flow of actions, for example, as described with respect to the flowchart(s) and block diagram(s) of operations and functionality depicted above. As used herein, the terms “machine-readable medium” and “computer-readable medium” are interchangeable. As used herein, the term “non-transitory computer-readable medium” is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
Also in a specific example, the instructions D182 on the processor D152 (separately, or in combination with the instructions D182 of the machine readable medium D160) may configure execution or operation of a trusted execution environment (TEE) D190. In an example, the TEE D190 operates as a protected area accessible to the processor D152 for secure execution of instructions and secure access to data. Various implementations of the TEE D190, and an accompanying secure area in the processor D152 or the memory D154 may be provided, for instance, through use of Intel@Software Guard Extensions (SGX) or ARM® TrustZone® hardware security extensions, Intel® Management Engine (ME), or Intel® Converged Security Manageability Engine (CSME). Other aspects of security hardening, hardware roots-of-trust, and trusted or protected operations may be implemented in the device D150 through the TEE D190 and the processor D152.
The processor platform D200 of the illustrated example includes processor circuitry D212. The processor circuitry D212 of the illustrated example is hardware. For example, the processor circuitry D212 can be implemented by one or more integrated circuits, logic circuits, FPGAs microprocessors, CPUs, GPUs, DSPs, and/or microcontrollers from any desired family or manufacturer. The processor circuitry D212 may be implemented by one or more semiconductor based (e.g., silicon based) devices.
The processor circuitry D212 of the illustrated example includes a local memory D213 (e.g., a cache, registers, etc.). The processor circuitry D212 of the illustrated example is in communication with a main memory including a volatile memory D214 and a non-volatile memory D216 by a bus D218. The volatile memory D214 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS® Dynamic Random Access Memory (RDRAM®), and/or any other type of RAM device. The non-volatile memory D216 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory D214, D216 of the illustrated example is controlled by a memory controller D217.
The processor platform D200 of the illustrated example also includes interface circuitry D220. The interface circuitry D220 may be implemented by hardware in accordance with any type of interface standard, such as an Ethernet interface, a universal serial bus (USB) interface, a Bluetooth® interface, a near field communication (NFC) interface, a PCI interface, and/or a PCIe interface.
In the illustrated example, one or more input devices D222 are connected to the interface circuitry D220. The input device(s) D222 permit(s) a user to enter data and/or commands into the processor circuitry D212. The input device(s) D222 can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, an isopoint device, and/or a voice recognition system.
One or more output devices D224 are also connected to the interface circuitry D220 of the illustrated example. The output devices D224 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer, and/or speaker. The interface circuitry D220 of the illustrated example, thus, typically includes a graphics driver card, a graphics driver chip, and/or graphics processor circuitry such as a GPU.
The interface circuitry D220 of the illustrated example also includes a communication device such as a transmitter, a receiver, a transceiver, a modem, a residential gateway, a wireless access point, and/or a network interface to facilitate exchange of data with external machines (e.g., computing devices of any kind) by a network D226. The communication can be by, for example, an Ethernet connection, a digital subscriber line (DSL) connection, a telephone line connection, a coaxial cable system, a satellite system, a line-of-site wireless system, a cellular telephone system, an optical connection, etc.
The processor platform D200 of the illustrated example also includes one or more mass storage devices D228 to store software and/or data. Examples of such mass storage devices D228 include magnetic storage devices, optical storage devices, floppy disk drives, HDDs, CDs, Blu-ray disk drives, redundant array of independent disks (RAID) systems, solid state storage devices such as flash memory devices, and DVD drives.
The machine executable instructions D232, which may be implemented by the machine readable instructions of any of the flowcharts disclosed herein, may be stored in the mass storage device D228, in the volatile memory D214, in the non-volatile memory D216, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.
The cores D302 may communicate by an example bus D304. In some examples, the bus D304 may implement a communication bus to effectuate communication associated with one(s) of the cores D302. For example, the bus D304 may implement at least one of an Inter-Integrated Circuit (I2C) bus, a Serial Peripheral Interface (SPI) bus, a PCI bus, or a PCIe bus. Additionally or alternatively, the bus D304 may implement any other type of computing or electrical bus. The cores D302 may obtain data, instructions, and/or signals from one or more external devices by example interface circuitry D306. The cores D302 may output data, instructions, and/or signals to the one or more external devices by the interface circuitry D306. Although the cores D302 of this example include example local memory D320 (e.g., Level 1 (L1) cache that may be split into an L1 data cache and an L1 instruction cache), the microprocessor D300 also includes example shared memory D310 that may be shared by the cores (e.g., Level 2 (L2 cache)) for high-speed access to data and/or instructions. Data and/or instructions may be transferred (e.g., shared) by writing to and/or reading from the shared memory D310. The local memory D320 of each of the cores D302 and the shared memory D310 may be part of a hierarchy of storage devices including multiple levels of cache memory and the main memory (e.g., the main memory D214, D216 of
Each core D302 may be referred to as a CPU, DSP, GPU, etc., or any other type of hardware circuitry. Each core D302 includes control unit circuitry D314, arithmetic and logic (AL) circuitry (sometimes referred to as an ALU) D316, a plurality of registers D318, the L1 cache D320, and an example bus D322. Other structures may be present. For example, each core D302 may include vector unit circuitry, single instruction multiple data (SIMD) unit circuitry, load/store unit (LSU) circuitry, branch/jump unit circuitry, floating-point unit (FPU) circuitry, etc. The control unit circuitry D314 includes semiconductor-based circuits structured to control (e.g., coordinate) data movement within the corresponding core D302. The AL circuitry D316 includes semiconductor-based circuits structured to perform one or more mathematic and/or logic operations on the data within the corresponding core D302. The AL circuitry D316 of some examples performs integer based operations. In other examples, the AL circuitry D316 also performs floating point operations. In yet other examples, the AL circuitry D316 may include first AL circuitry that performs integer based operations and second AL circuitry that performs floating point operations. In some examples, the AL circuitry D316 may be referred to as an Arithmetic Logic Unit (ALU). The registers D318 are semiconductor-based structures to store data and/or instructions such as results of one or more of the operations performed by the AL circuitry D316 of the corresponding core D302. For example, the registers D318 may include vector register(s), SIMD register(s), general purpose register(s), flag register(s), segment register(s), machine specific register(s), instruction pointer register(s), control register(s), debug register(s), memory management register(s), machine check register(s), etc. The registers D318 may be arranged in a bank as shown in
Each core D302 and/or, more generally, the microprocessor D300 may include additional and/or alternate structures to those shown and described above. For example, one or more clock circuits, one or more power supplies, one or more power gates, one or more cache home agents (CHAs), one or more converged/common mesh stops (CMSs), one or more shifters (e.g., barrel shifter(s)) and/or other circuitry may be present. The microprocessor D300 is a semiconductor device fabricated to include many transistors interconnected to implement the structures described above in one or more integrated circuits (ICs) contained in one or more packages. The processor circuitry may include and/or cooperate with one or more accelerators. In some examples, accelerators are implemented by logic circuitry to perform certain tasks more quickly and/or efficiently than can be done by a general purpose processor. Examples of accelerators include ASICs and FPGAs such as those discussed herein. A GPU or other programmable device can also be an accelerator. Accelerators may be on-board the processor circuitry, in the same chip package as the processor circuitry and/or in one or more separate packages from the processor circuitry.
More specifically, in contrast to the microprocessor D300 of
In the example of
The interconnections D410 of the illustrated example are conductive pathways, traces, vias, or the like that may include electrically controllable switches (e.g., transistors) whose state can be changed by programming (e.g., using an HDL instruction language) to activate or deactivate one or more connections between one or more of the logic gate circuitry D408 to program desired logic circuits.
The storage circuitry D412 of the illustrated example is structured to store result(s) of the one or more of the operations performed by corresponding logic gates. The storage circuitry D412 may be implemented by registers or the like. In the illustrated example, the storage circuitry D412 is distributed amongst the logic gate circuitry D408 to facilitate access and increase execution speed.
The example FPGA circuitry D400 of
Although
In some examples, the processor circuitry D212 of
Often, IoT devices are limited in memory, size, or functionality, allowing larger numbers to be deployed for a similar cost to smaller numbers of larger devices. However, an IoT device may be a smart phone, laptop, tablet, or PC, or other larger device. Further, an IoT device may be a virtual device, such as an application on a smart phone or other computing device. IoT devices may include IoT gateways, used to couple IoT devices to other IoT devices and to cloud applications, for data storage, process control, and the like.
Networks of IoT devices may include commercial and home automation devices, such as water distribution systems, electric power distribution systems, pipeline control systems, plant control systems, light switches, thermostats, locks, cameras, alarms, motion sensors, and the like. The IoT devices may be accessible through remote computers, servers, and other systems, for example, to control systems or access data.
The future growth of the Internet and like networks may involve very large numbers of IoT devices. Accordingly, in the context of the techniques discussed herein, a number of innovations for such future networking will address the need for all these layers to grow unhindered, to discover and make accessible connected resources, and to support the ability to hide and compartmentalize connected resources. Any number of network protocols and communications standards may be used, wherein each protocol and standard is designed to address specific objectives. Further, the protocols are part of the fabric supporting human accessible services that operate regardless of location, time or space. The innovations include service delivery and associated infrastructure, such as hardware and software; security enhancements; and the provision of services based on Quality of Service (QoS) terms specified in service level and service delivery agreements. As will be understood, the use of IoT devices and networks, such as those introduced in
The network topology may include any number of types of IoT networks, such as a mesh network provided with the network F156 using Bluetooth low energy (BLE) links F122. Other types of IoT networks that may be present include a wireless local area network (WLAN) network F158 used to communicate with IoT devices F104 through IEEE 802.11 (Wi-Fi®) links F128, a cellular network F160 used to communicate with IoT devices F104 through an LTE/LTE-A (4G) or 5G cellular network, and a low-power wide area (LPWA) network F162, for example, a LPWA network compatible with the LoRaWan specification promulgated by the LoRa alliance, or a IPv6 over Low Power Wide-Area Networks (LPWAN) network compatible with a specification promulgated by the Internet Engineering Task Force (IETF). Further, the respective IoT networks may communicate with an outside network provider (e.g., a tier 2 or tier 3 provider) using any number of communications links, such as an LTE cellular link, an LPWA link, or a link based on the IEEE 802.15.4 standard, such as Zigbee®. The respective IoT networks may also operate with use of a variety of network and internet application protocols such as Constrained Application Protocol (CoAP). The respective IoT networks may also be integrated with coordinator devices that provide a chain of links that forms cluster tree of linked devices and networks.
Each of these IoT networks may provide opportunities for new technical features, such as those as described herein. The improved technologies and networks may enable the exponential growth of devices and networks, including the use of IoT networks into “fog” devices or integrated into “Edge” computing systems. As the use of such improved technologies grows, the IoT networks may be developed for self-management, functional evolution, and collaboration, without needing direct human intervention. The improved technologies may even enable IoT networks to function without centralized controlled systems. Accordingly, the improved technologies described herein may be used to automate and enhance network management and operation functions far beyond current implementations.
In an example, communications between IoT devices F104, such as over the backbone links F102, may be protected by a decentralized system for authentication, authorization, and accounting (AAA). In a decentralized AAA system, distributed payment, credit, audit, authorization, and authentication systems may be implemented across interconnected heterogeneous network infrastructure. This allows systems and networks to move towards autonomous operations. In these types of autonomous operations, machines may even contract for human resources and negotiate partnerships with other machine networks. This may allow the achievement of mutual objectives and balanced service delivery against outlined, planned service level agreements as well as achieve solutions that provide metering, measurements, traceability, and trackability. The creation of new supply chain structures and methods may enable a multitude of services to be created, mined for value, and collapsed without any human involvement.
Such IoT networks may be further enhanced by the integration of sensing technologies, such as sound, light, electronic traffic, facial and pattern recognition, smell, vibration, into the autonomous organizations among the IoT devices. The integration of sensory systems may allow systematic and autonomous communication and coordination of service delivery against contractual service objectives, orchestration and quality of service (QoS) based swarming and fusion of resources. Some of the individual examples of network-based resource processing include the following.
The mesh network F156, for instance, may be enhanced by systems that perform inline data-to-information transforms. For example, self-forming chains of processing resources comprising a multi-link network may distribute the transformation of raw data to information in an efficient manner, and the ability to differentiate between assets and resources and the associated management of each. Furthermore, the proper components of infrastructure and resource based trust and service indices may be inserted to improve the data integrity, quality, assurance and deliver a metric of data confidence.
The WLAN network F158, for instance, may use systems that perform standards conversion to provide multi-standard connectivity, enabling IoT devices F104 using different protocols to communicate. Further systems may provide seamless interconnectivity across a multi-standard infrastructure comprising visible Internet resources and hidden Internet resources.
Communications in the cellular network F160, for instance, may be enhanced by systems that offload data, extend communications to more remote devices, or both. The LPWA network F162 may include systems that perform non-Internet protocol (IP) to IP interconnections, addressing, and routing. Further, each of the IoT devices F104 may include the appropriate transceiver for wide area communications with that device. Further, each IoT device F104 may include other transceivers for communications using additional protocols and frequencies. This is discussed further with respect to the communication environment and hardware of an IoT processing device depicted in
Finally, clusters of IoT devices may be equipped to communicate with other IoT devices as well as with a cloud network. This may allow the IoT devices to form an ad-hoc network between the devices, allowing them to function as a single device, which may be termed a fog device, fog platform, or fog network. This configuration is discussed further with respect to
The fog network F220 may be considered to be a massively interconnected network wherein a number of IoT devices F202 are in communications with each other, for example, by radio links F222. The fog network F220 may establish a horizontal, physical, or virtual resource platform that can be considered to reside between IoT Edge devices and cloud or data centers. A fog network, in some examples, may support vertically-isolated, latency-sensitive applications through layered, federated, or distributed computing, storage, and network connectivity operations. However, a fog network may also be used to distribute resources and services at and among the Edge and the cloud. Thus, references in the present document to the “Edge”, “fog”, and “cloud” are not necessarily discrete or exclusive of one another.
As an example, the fog network F220 may be facilitated using an interconnect specification released by the Open Connectivity Foundation™ (OCF). This standard allows devices to discover each other and establish communications for interconnects. Other interconnection protocols may also be used, including, for example, the optimized link state routing (OLSR) Protocol, the better approach to mobile ad-hoc networking (B.A.T.M.A.N.) routing protocol, or the OMA Lightweight M2M (LWM2M) protocol, among others.
Three types of IoT devices F202 are shown in this example, gateways F204, data aggregators F226, and sensors F228, although any combinations of IoT devices F202 and functionality may be used. The gateways F204 may be Edge devices that provide communications between the cloud F200 and the fog network F220, and may also provide the backend process function for data obtained from sensors F228, such as motion data, flow data, temperature data, and the like. The data aggregators F226 may collect data from any number of the sensors F228, and perform the back end processing function for the analysis. The results, raw data, or both may be passed along to the cloud F200 through the gateways F204. The sensors F228 may be full IoT devices F202, for example, capable of both collecting data and processing the data. In some cases, the sensors F228 may be more limited in functionality, for example, collecting the data and allowing the data aggregators F226 or gateways F204 to process the data.
Communications from any IoT device F202 may be passed along a convenient path between any of the IoT devices F202 to reach the gateways F204. In these networks, the number of interconnections provide substantial redundancy, allowing communications to be maintained, even with the loss of a number of IoT devices F202. Further, the use of a mesh network may allow IoT devices F202 that are very low power or located at a distance from infrastructure to be used, as the range to connect to another IoT device F202 may be much less than the range to connect to the gateways F204.
The fog network F220 provided from these IoT devices F202 may be presented to devices in the cloud F200, such as a server F206, as a single device located at the Edge of the cloud F200, e.g., a fog network operating as a device or platform. In this example, the alerts coming from the fog platform may be sent without being identified as coming from a specific IoT device F202 within the fog network F220. In this fashion, the fog network F220 may be considered a distributed platform that provides computing and storage resources to perform processing or data-intensive tasks such as data analytics, data aggregation, and machine-learning, among others.
In some examples, the IoT devices F202 may be configured using an imperative programming style, e.g., with each IoT device F202 having a specific function and communication partners. However, the IoT devices F202 forming the fog platform may be configured in a declarative programming style, enabling the IoT devices F202 to reconfigure their operations and communications, such as to determine needed resources in response to conditions, queries, and device failures. As an example, a query from a user located at a server F206 about the operations of a subset of equipment monitored by the IoT devices F202 may result in the fog network F220 device the IoT devices F202, such as particular sensors F228, needed to answer the query. The data from these sensors F228 may then be aggregated and analyzed by any combination of the sensors F228, data aggregators F226, or gateways F204, before being sent on by the fog network F220 to the server F206 to answer the query. In this example, IoT devices F202 in the fog network F220 may select the sensors F228 used based on the query, such as adding data from flow sensors or temperature sensors. Further, if some of the IoT devices F202 are not operational, other IoT devices F202 in the fog network F220 may provide analogous data, if available.
In other examples, the operations and functionality described herein may be embodied by an IoT or Edge compute device in the example form of an electronic processing system, within which a set or sequence of instructions may be executed to cause the electronic processing system to perform any one of the methodologies discussed herein, according to an example embodiment. The device may be an IoT device or an IoT gateway, including a machine embodied by aspects of a personal computer (PC), a tablet PC, a personal digital assistant (PDA), a mobile telephone or smartphone, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that machine.
Further, while only a single machine may be depicted and referenced in the examples above, such machine shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. Further, these and like examples to a processor-based system shall be taken to include any set of one or more machines that are controlled by or operated by a processor, set of processors, or processing circuitry (e.g., a computer) to individually or jointly execute instructions to perform any one or more of the methodologies discussed herein. Accordingly, in various examples, applicable means for processing (e.g., processing, controlling, generating, evaluating, etc.) may be embodied by such processing circuitry.
Other example groups of IoT devices may include remote weather stations F314, local information terminals F316, alarm systems F318, automated teller machines F320, alarm panels F322, or moving vehicles, such as emergency vehicles F324 or other vehicles F326, among many others. Each of these IoT devices may be in communication with other IoT devices, with servers F304, with another IoT fog device or system (not shown, but depicted in
As may be seen from
Clusters of IoT devices, such as the remote weather stations F314 or the traffic control group F306, may be equipped to communicate with other IoT devices as well as with the cloud F300. This may allow the IoT devices to form an ad-hoc network between the devices, allowing them to function as a single device, which may be termed a fog device or system (e.g., as described above with reference to
The IoT device F450 may include processing circuitry in the form of a processor F452, which may be a microprocessor, a multi-core processor, a multithreaded processor, an ultra-low voltage processor, an embedded processor, or other known processing elements. The processor F452 may be a part of a system on a chip (SoC) in which the processor F452 and other components are formed into a single integrated circuit, or a single package, such as the Edison™ or Galileo™ SoC boards from Intel. As an example, the processor F452 may include an Intel® Architecture Core™ based processor, such as a Quark™, an Atom™, an i3, an i5, an i7, or an MCU-class processor, or another such processor available from Intel® Corporation, Santa Clara, CA. However, any number other processors may be used, such as available from Advanced Micro Devices, Inc. (AMD) of Sunnyvale, CA, a MIPS-based design from MIPS Technologies, Inc. of Sunnyvale, CA, an ARM-based design licensed from ARM Holdings, Ltd. or customer thereof, or their licensees or adopters. The processors may include units such as an A5-A14 processor from Apple® Inc., a Snapdragon™ processor from Qualcomm® Technologies, Inc., or an OMAP™ processor from Texas Instruments, Inc.
The processor F452 may communicate with a system memory F454 over an interconnect F456 (e.g., a bus). Any number of memory devices may be used to provide for a given amount of system memory. As examples, the memory may be random access memory (RAM) in accordance with a Joint Electron Devices Engineering Council (JEDEC) design such as the DDR or mobile DDR standards (e.g., LPDDR, LPDDR2, LPDDR3, or LPDDR4). In various implementations the individual memory devices may be of any number of different package types such as single die package (SDP), dual die package (DDP) or quad die package (Q17P). These devices, in some examples, may be directly soldered onto a motherboard to provide a lower profile solution, while in other examples the devices are configured as one or more memory modules that in turn couple to the motherboard by a given connector. Any number of other memory implementations may be used, such as other types of memory modules, e.g., dual inline memory modules (DIMMs) of different varieties including but not limited to microDIMMs or MiniDIMMs.
To provide for persistent storage of information such as data, applications, operating systems and so forth, a storage F458 may also couple to the processor F452 via the interconnect F456. In an example the storage F458 may be implemented via a solid state disk drive (SSDD). Other devices that may be used for the storage F458 include flash memory cards, such as SD cards, microSD cards, xD picture cards, and the like, and USB flash drives. In low power implementations, the storage F458 may be on-die memory or registers associated with the processor F452. However, in some examples, the storage F458 may be implemented using a micro hard disk drive (HDD). Further, any number of new technologies may be used for the storage F458 in addition to, or instead of, the technologies described, such resistance change memories, phase change memories, holographic memories, or chemical memories, among others.
The components may communicate over the interconnect F456. The interconnect F456 may include any number of technologies, including industry standard architecture (ISA), extended ISA (EISA), peripheral component interconnect (PCI), peripheral component interconnect extended (PCIx), PCI express (PCIe), or any number of other technologies. The interconnect F456 may be a proprietary bus, for example, used in a SoC based system. Other bus systems may be included, such as an I2C interface, an SPI interface, point to point interfaces, and a power bus, among others.
Given the variety of types of applicable communications from the device to another component or network, applicable communications circuitry used by the device may include or be embodied by any one or more of components F462, F466, F468, or F470. Accordingly, in various examples, applicable means for communicating (e.g., receiving, transmitting, etc.) may be embodied by such communications circuitry.
The interconnect F456 may couple the processor F452 to a mesh transceiver F462, for communications with other mesh devices F464. The mesh transceiver F462 may use any number of frequencies and protocols, such as 2.4 Gigahertz (GHz) transmissions under the IEEE 802.15.4 standard, using the Bluetooth® low energy (BLE) standard, as defined by the Bluetooth® Special Interest Group, or the ZigBee® standard, among others. Any number of radios, configured for a particular wireless communication protocol, may be used for the connections to the mesh devices F464. For example, a WLAN unit may be used to implement Wi-Fi™ Communications in accordance with the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standard. In addition, wireless wide area communications, e.g., according to a cellular or other wireless wide area protocol, may occur via a WWAN unit.
The mesh transceiver F462 may communicate using multiple standards or radios for communications at different range. For example, the IoT device F450 may communicate with close devices, e.g., within about 10 meters, using a local transceiver based on BLE, or another low power radio, to save power. More distant mesh devices F464, e.g., within about 50 meters, may be reached over ZigBee or other intermediate power radios. Both communications techniques may take place over a single radio at different power levels, or may take place over separate transceivers, for example, a local transceiver using BLE and a separate mesh transceiver using ZigBee.
A wireless network transceiver F466 may be included to communicate with devices or services in the cloud F400 via local or wide area network protocols. The wireless network transceiver F466 may be a LPWA transceiver that follows the IEEE 802.15.4, or IEEE 802.15.4g standards, among others. The IoT device F450 may communicate over a wide area using LoRaWAN™ (Long Range Wide Area Network) developed by Semtech and the LoRa Alliance. The techniques described herein are not limited to these technologies, but may be used with any number of other cloud transceivers that implement long range, low bandwidth communications, such as Sigfox, and other technologies. Further, other communications techniques, such as time-slotted channel hopping, described in the IEEE 802.15.4e specification may be used.
Any number of other radio communications and protocols may be used in addition to the systems mentioned for the mesh transceiver F462 and wireless network transceiver F466, as described herein. For example, the radio transceivers F462 and F466 may include an LTE or other cellular transceiver that uses spread spectrum (SPA/SAS) communications for implementing high speed communications. Further, any number of other protocols may be used, such as Wi-Fi® networks for medium speed communications and provision of network communications.
The radio transceivers F462 and F466 may include radios that are compatible with any number of 3GPP (Third Generation Partnership Project) specifications, notably Long Term Evolution (LTE), Long Term Evolution-Advanced (LTE-A), and Long Term Evolution-Advanced Pro (LTE-A Pro). It may be noted that radios compatible with any number of other fixed, mobile, or satellite communication technologies and standards may be selected. These may include, for example, any Cellular Wide Area radio communication technology, which may include e.g. a 5th Generation (5G) communication systems, a Global System for Mobile Communications (GSM) radio communication technology, a General Packet Radio Service (GPRS) radio communication technology, or an Enhanced Data Rates for GSM Evolution (EDGE) radio communication technology, a UMTS (Universal Mobile Telecommunications System) communication technology, In addition to the standards listed above, any number of satellite uplink technologies may be used for the wireless network transceiver F466, including, for example, radios compliant with standards issued by the ITU (International Telecommunication Union), or the ETSI (European Telecommunications Standards Institute), among others. The examples provided herein are thus understood as being applicable to various other communication technologies, both existing and not yet formulated.
A network interface controller (NIC) F468 may be included to provide a wired communication to the cloud F400 or to other devices, such as the mesh devices F464. The wired communication may provide an Ethernet connection, or may be based on other types of networks, such as Controller Area Network (CAN), Local Interconnect Network (LIN), DeviceNet, ControlNet, Data Highway+, PROFIBUS, or PROFINET, among many others. An additional NIC F468 may be included to allow connect to a second network, for example, a NIC F468 providing communications to the cloud over Ethernet, and a second NIC F468 providing communications to other devices over another type of network.
The interconnect F456 may couple the processor F452 to an external interface F470 that is used to connect external devices or subsystems. The external devices may include sensors F472, such as accelerometers, level sensors, flow sensors, optical light sensors, camera sensors, temperature sensors, a global positioning system (GPS) sensors, pressure sensors, barometric pressure sensors, and the like. The external interface F470 further may be used to connect the IoT device F450 to actuators F474, such as power switches, valve actuators, an audible sound generator, a visual warning device, and the like.
In some optional examples, various input/output (I/O) devices may be present within, or connected to, the IoT device F450. For example, a display or other output device F484 may be included to show information, such as sensor readings or actuator position. An input device F486, such as a touch screen or keypad may be included to accept input. An output device F486 may include any number of forms of audio or visual display, including simple visual outputs such as binary status indicators (e.g., LEDs) and multi-character visual outputs, or more complex outputs such as display screens (e.g., LCD screens), with the output of characters, graphics, multimedia objects, and the like being generated or produced from the operation of the IoT device F450.
A battery F476 may power the IoT device F450, although in examples in which the IoT device F450 is mounted in a fixed location, it may have a power supply coupled to an electrical grid. The battery F476 may be a lithium ion battery, or a metal-air battery, such as a zinc-air battery, an aluminum-air battery, a lithium-air battery, and the like.
A battery monitor/charger F478 may be included in the IoT device F450 to track the state of charge (SoCh) of the battery F476. The battery monitor/charger F478 may be used to monitor other parameters of the battery F476 to provide failure predictions, such as the state of health (SoH) and the state of function (SoF) of the battery F476. The battery monitor/charger F478 may include a battery monitoring integrated circuit, such as an LTC4020 or an LTC2990 from Linear Technologies, an ADT7488A from ON Semiconductor of Phoenix Arizona, or an IC from the UCD90xxx family from Texas Instruments of Dallas, TX. The battery monitor/charger F478 may communicate the information on the battery F476 to the processor F452 over the interconnect F456. The battery monitor/charger F478 may also include an analog-to-digital (ADC) convertor that allows the processor F452 to directly monitor the voltage of the battery F476 or the current flow from the battery F476. The battery parameters may be used to determine actions that the IoT device F450 may perform, such as transmission frequency, mesh network operation, sensing frequency, and the like.
A power block F480, or other power supply coupled to a grid, may be coupled with the battery monitor/charger F478 to charge the battery F476. In some examples, the power block F480 may be replaced with a wireless power receiver to obtain the power wirelessly, for example, through a loop antenna in the IoT device F450. A wireless battery charging circuit, such as an LTC4020 chip from Linear Technologies of Milpitas, CA, among others, may be included in the battery monitor/charger F478. The specific charging circuits chosen depend on the size of the battery F476, and thus, the current required. The charging may be performed using the Airfuel standard promulgated by the Airfuel Alliance, the Qi wireless charging standard promulgated by the Wireless Power Consortium, or the Rezence charging standard, promulgated by the Alliance for Wireless Power, among others.
The storage F458 may include instructions F482 in the form of software, firmware, or hardware commands to implement the techniques described herein. Although such instructions F482 are shown as code blocks included in the memory F454 and the storage F458, it may be understood that any of the code blocks may be replaced with hardwired circuits, for example, built into an application specific integrated circuit (ASIC).
In an example, the instructions F482 provided via the memory F454, the storage F458, or the processor F452 may be embodied as a non-transitory, machine readable medium F460 including code to direct the processor F452 to perform electronic operations in the IoT device F450. The processor F452 may access the non-transitory, machine readable medium F460 over the interconnect F456. For instance, the non-transitory, machine readable medium F460 may be embodied by devices described for the storage F458 of
Also in a specific example, the instructions F488 on the processor F452 (separately, or in combination with the instructions F488 of the machine readable medium F460) may configure execution or operation of a trusted execution environment (TEE) F490. In an example, the TEE F490 operates as a protected area accessible to the processor F452 for secure execution of instructions and secure access to data. Various implementations of the TEE F490, and an accompanying secure area in the processor F452 or the memory F454 may be provided, for instance, through use of Intel® Software Guard Extensions (SGX) or ARM® TrustZone® hardware security extensions, Intel® Management Engine (ME), or Intel® Converged Security Manageability Engine (CSME). Other aspects of security hardening, hardware roots-of-trust, and trusted or protected operations may be implemented in the device F450 through the TEE F490 and the processor F452.
At a more generic level, an Edge computing system may be described to encompass any number of deployments operating in an Edge cloud A110, which provide coordination from client and distributed computing devices.
Each node or device of the Edge computing system is located at a particular layer corresponding to layers F510, F520, F530, F540, F550. For example, the client compute nodes F502 are each located at an endpoint layer F510, while each of the Edge gateway nodes F512 are located at an Edge devices layer F520 (local level) of the Edge computing system. Additionally, each of the Edge aggregation nodes F522 (and/or fog devices F524, if arranged or operated with or among a fog networking configuration F526) are located at a network access layer F530 (an intermediate level). Fog computing (or “fogging”) generally refers to extensions of cloud computing to the Edge of an enterprise's network, typically in a coordinated distributed or multi-node network. Some forms of fog computing provide the deployment of compute, storage, and networking services between end devices and cloud computing data centers, on behalf of the cloud computing locations. Such forms of fog computing provide operations that are consistent with Edge computing as discussed herein; many of the Edge computing aspects discussed herein are applicable to fog networks, fogging, and fog configurations. Further, aspects of the Edge computing systems discussed herein may be configured as a fog, or aspects of a fog may be integrated into an Edge computing architecture.
The core data center F532 is located at a core network layer F540 (e.g., a regional or geographically-central level), while the global network cloud F542 is located at a cloud data center layer F550 (e.g., a national or global layer). The use of “core” is provided as a term for a centralized network location-deeper in the network-which is accessible by multiple Edge nodes or components; however, a “core” does not necessarily designate the “center” or the deepest location of the network. Accordingly, the core data center F532 may be located within, at, or near the Edge cloud A110.
Although an illustrative number of client compute nodes F502, Edge gateway nodes F512, Edge aggregation nodes F522, core data centers F532, global network clouds F542 are shown in
Consistent with the examples provided herein, each client compute node F502 may be embodied as any type of end point component, device, appliance, or “thing” capable of communicating as a producer or consumer of data. Further, the label “node” or “device” as used in the Edge computing system F500 does not necessarily mean that such node or device operates in a client or agent/minion/follower role; rather, any of the nodes or devices in the Edge computing system F500 refer to individual entities, nodes, or subsystems which include discrete or connected hardware or software configurations to facilitate or use the Edge cloud A110.
As such, the Edge cloud A110 is formed from network components and functional features operated by and within the Edge gateway nodes F512 and the Edge aggregation nodes F522 of layers F520, F530, respectively. The Edge cloud A110 may be embodied as any type of network that provides Edge computing and/or storage resources which are proximately located to radio access network (RAN) capable endpoint devices (e.g., mobile computing devices, IoT devices, smart devices, etc.), which are shown in
In some examples, the Edge cloud A110 may form a portion of or otherwise provide an ingress point into or across a fog networking configuration F526 (e.g., a network of fog devices F524, not shown in detail), which may be embodied as a system-level horizontal and distributed architecture that distributes resources and services to perform a specific function. For instance, a coordinated and distributed network of fog devices F524 may perform computing, storage, control, or networking aspects in the context of an IoT system arrangement. Other networked, aggregated, and distributed functions may exist in the Edge cloud A110 between the cloud data center layer F550 and the client endpoints (e.g., client compute nodes F502). Some of these are discussed in the following sections in the context of network functions or service virtualization, including the use of virtual Edges and virtual services which are orchestrated for multiple stakeholders.
The Edge gateway nodes F512 and the Edge aggregation nodes F522 cooperate to provide various Edge services and security to the client compute nodes F502. Furthermore, because each client compute node F502 may be stationary or mobile, each Edge gateway node F512 may cooperate with other Edge gateway devices to propagate presently provided Edge services and security as the corresponding client compute node F502 moves about a region. To do so, each of the Edge gateway nodes F512 and/or Edge aggregation nodes F522 may support multiple tenancy and multiple stakeholder configurations, in which services from (or hosted for) multiple service providers and multiple consumers may be supported and coordinated across a single or multiple compute devices.
A block diagram illustrating an example software distribution platform 1105 to distribute software such as the example machine readable instructions D232 of
Entities that service workload requests are chartered with the responsibility of distributing those workloads in a manner that satisfies client demands. In some environments, the underlying platform resources are known ahead of time so that a workload for a target computational resource is optimized. However, Edge computing resources are being utilized to a greater extent and the target computational devices are heterogeneous. Target computational devices may include CPUs, GPUs, FPGAs and/or other types of accelerators.
Current workload optimization is not handled in a scalable manner when the workload is operating on a first computational device (e.g., a CPU) at a first time and a second computational device (e.g., a GPU) at a second time, in which the computational devices from the first time to the second time are different. Current workload optimization also fails to consider service level agreements (SLAs) in combination with utilization information. Today, handling such dynamic inconsistencies in target computational devices causes workload efficiency to suffer, which further causes client expectation problems. Examples disclosed herein also support dynamic hybrid combinations. For example, in one instance a workload may only run on a CPU or a GPU, but at a second instance the workload may be structured to run on both, or on various combinations of computational resources.
Example improvements disclosed herein develop workload optimizations for heterogenous environments (e.g., a first edge platform with a CPU, a second edge platform with a GPU, a third edge platform with a combination of CPU and FPGA, etc.) in a manner that considers client SLA parameters and utilization parameters. In some examples disclosed herein, optimization for all devices of a current/known platform occurs prior to runtime (e.g., an inference phase) to allow dynamic switching of one or more portions of the workload (e.g., selection of different optimized graphs). In some examples disclosed herein, dynamic switching of one or more portions of the workload can be directed to alternate ones of the available heterogenous devices of the Edge network. In some examples, different available resources may be located at any location within the example Edge cloud A110 described above. In some examples, available resources reside at a far edge during a first time and due to, for example, changing demands of the far edge resources, remaining resources become limited to near edge locations within the Edge cloud A110. Examples disclosed herein accommodate for circumstances where workload requirements and corresponding choices for accelerators (e.g., based on need) are dynamic.
Examples disclosed herein consider any number and/or type of workload, such as AI algorithms, connected graphs, and/or other algorithms stitched together to accomplish a relatively larger task objective(s). For instance, when optimizing a ResNet50 neural network, examples disclosed herein identify whether particular layers are more suited to run on particular target devices, such as a CPU rather than a GPU.
FIG. ID5_A illustrates an example framework ID5_A100 to optimize a workload. In the illustrated example of FIG. ID5_A, the framework ID5_A100 includes an example workload or pool of workloads ID5_A102 and example platform resources ID5_A104 that are known to be available at a given moment. The example platform resources ID5_A104 include any number and or type of devices capable of performing workload tasks including, but not limited to, CPUs, GPUs, FPGAs, ASICs, accelerators, etc. As described in further detail below, the workload ID5_A102 is optimized in view of the respective resources ID5_A104 to generate optimized graphs ID5_A106 corresponding to each resource. In some examples, an optimized graph is represented as a neural architecture having a particular number of layers, connections (nodes), weights and/or hyperparameters. Example weights disclosed herein include one or more values represented by alphanumeric characters. Such values may be stored in one or more data structures (e.g., data structure(s)), in which example data structures include integers, floating point representations and/or characters. Weights and corresponding values to represent such weights represent data stored in any manner. Such data may also propagate from a first data structure to a second or any number of subsequent data structures along a data path, such as a bus. An aggregate of the optimized graphs ID5_A106 is consolidated as a union of graphs ID5_A108 and packaged within the workload ID5_A102 to create a packaged workload ID5_A110.
At least one benefit of the packaged workload ID5_A110 is that examples disclosed herein include and/or otherwise embed additional semantic information into the workload so that on-the-fly decisions can occur in view of dynamic conditions during runtime ID5_A112. Examples disclosed herein retrieve, receive and/or otherwise obtain SLA information/parameters ID5_A114 and current utilization information ID5_A116. As used herein, SLA parameters represent constraints to be satisfied by workload execution, such as accuracy metrics, speed metrics, power consumption metrics, cost (e.g., financial cost, processor burden cost, etc.) metrics, etc. As used herein, utilization parameters represent real-time operating conditions of platforms and/or underlying computing resources thereof that are executing the workload and/or operating conditions of candidate platforms that could be considered for the workload in the future.
Dynamic conditions include but are not limited to changing conditions of underlying hardware (and/or underlying allocated VMs), changing conditions of hardware characteristics (e.g., multiple tenant use versus single tenant use, cost of utilization of the underlying resources), and/or changing conditions of a Service Level Agreement (SLA). In some examples, a client with a workload to be executed and/or otherwise processed by computing resources is forced to move the workload to one or more alternate or additional resources. For instance, a current computing resource (e.g., one or more of the example resources ID5_A104) may become unavailable, a current computing resource financial cost may exceed one or more SLA thresholds established by the client, and/or a current computing resource may be inundated by requests from one or more other tenants. However, in response to such dynamic possibilities, examples disclosed herein enable prompt adjustment of identifying which computing resources should handle the workload(s), which graph(s) to employ with the workload(s) in view of available computing resources, and/or invoke SLA renegotiation efforts.
Existing optimization techniques consider predetermined platforms corresponding to predetermined workloads and their graphs. However, examples disclosed herein acknowledge that different workload graphs can achieve workload objectives with comparable performance (e.g., speed). For instance, optimizations for ResNet50 include different types of graph configurations based on the underlying computational devices that will execute the workload. Example graph configurations may include different layer structure arrangements (e.g., in the event a convolutional neural network (CNN) is used), such as a first layer (e.g., 7×7 Conv layer) connected to a second layer (e.g., 5×5 Conv layer) connected to a third layer . . . etc. A graph configuration for a particular target computational device may be referred to as a path. However, an alternate target computational device may reveal an optimized graph configuration (e.g., a second path) that is different for the same workload (e.g., two 7×7 Conv layers connected). The example first and second paths may accomplish the workload objective and may even have substantially the same performance (e.g., efficiency, speed, power consumption). In some examples, the first and second paths may accomplish the workload objective with substantially similar performance in some respects (e.g., efficiency, speed, power consumption) and substantially different performance in other respects (e.g., cost (e.g., financial, processor burden), latency). Examples disclosed herein optionally expose one or more knobs to facilitate selection and/or adjustment of selectable options. Knobs include, but are not limited to particular target hardware device preferences (e.g., CPU, GPU, FPGA, accelerator, particular CPU core selection(s), uncore frequencies, memory bandwidth, cache partitioning/reservation, FPGA RTL partitioning, GPU execution units partitioning/reservation, etc.), and particular optimization parameter preferences (e.g., improved latency, improved energy consumption, improved accuracy, etc.). Knobs may be selected by, for example, a user and/or an agent. In some examples, knobs are selected, added, removed, etc. via an interface (e.g., a user interface, a graphical user interface (GUI)). In some examples, the interface renders, informs and/or otherwise displays current knobs and their corresponding values, in which alternate knobs and/or corresponding values can be augmented (e.g., selected by a user). Agent knob adjustment may occur in an automatic manner independent of the user in an effort to identify one or more particular optimized knob settings.
However, in the event of dynamic utilization parameters, changing conditions may cause one of these paths to deteriorate and/or otherwise fail to meet performance expectations. In other examples, changing conditions may have no effect on the efficacy and/or efficiency of workload performance, but might violate the SLA parameters (e.g., the workload costs too much to execute on the target computational device, the workload consumes too much power, etc.). FIG. ID5_B illustrates example graph semantic embedding for a graph of interest ID5_B200. Examples disclosed herein accommodate for such dynamic conditions in heterogenous environments. In the illustrated example of FIG. ID5_B, each node and each subgraph on a compute device (e.g., CPU, GPU, XPU, FPGA, etc.) corresponds to a particular cost (e.g., latency). Information related to nodes and subgraphs is collected and analyzed (i.e., type of compute e.g., vector, systolic, etc., input/output data size, etc.). Different embeddings capture corresponding probabilities of certain paths in a graph towards the accuracy of the network. Based on the available information, dynamic decisions are made to schedule part of subgraph on different available compute devices in a platform. Optimized graphs ID5_B200 disclosed herein consider more than just a generic parameter in view of a predetermined computing device, but also incorporate semantic information that can trigger dynamic decisions of (a) candidate target computing devices to best handle the workload and (b) candidate graphs to best facilitate workload operation in view of dynamic SLA information, dynamic computing resource conditions, and key performance indicator (KPI) information. For instance, in response to a need (e.g., perhaps an unexpected need) to seek workload execution (e.g., an AI algorithm, ResNet50, etc.) from a cloud (e.g., a Google® Cloud platform, Amazon Web Services® Cloud platform), an attach point may include one or more computing devices different than what may have been used on prior occasions. A first or prior occasion of workload execution may have been in view of first SLA parameters/criteria, in which a financial cost of using such computing hardware was relatively low. In the event of changing SLA criteria and/or in the event of changing demand on the current computing hardware, examples disclosed herein adapt an applied model (e.g., optimized graphs). A first Cloud service provider (e.g., Google®) may have raised its prices for usage privileges, but a second Cloud service provider (e.g., AWS®) may have maintained or reduced their costs. Examples disclosed herein may partition workloads in view of these changing conditions.
FIG. ID5_C illustrates example optimizing circuitry ID5_C300 to optimize workloads in a heterogenous environment. In the illustrated example of FIG. ID5_C, the optimizing circuitry ID5_C300 includes example benchmark managing circuitry ID5_C302, example SLA managing circuitry ID5_C304, example hyper parameter tuning circuitry ID5_C306, example reconfiguration managing circuitry ID5_C308, example agent managing circuitry ID5_C310, and example workload activity detecting circuitry ID5_C312.
In some examples, the optimizing circuitry ID5_C300 includes means for managing benchmarks, means for managing SLAs, means for tuning hyperparameters, means for managing reconfigurations, means for managing agents, and means for detecting workload activity. For example, the means for managing benchmarks may be implemented by benchmark managing circuitry ID5_C302, the means for managing SLAs may be implemented by SLA managing circuitry ID5_C304, the means for tuning hyperparameters may be implemented by hyperparameter tuning circuitry ID5_306, the means for managing reconfigurations may be implemented by the reconfiguration managing circuitry ID5_308, the means for managing agents may be implemented by the agent managing circuitry ID5_C310, and the means for detecting workload activity may be implemented by the workload activity detecting circuitry ID5_C312. In some examples, the benchmark managing circuitry ID5_C302, the SLA managing circuitry ID5_C304, the hyperparameter tuning circuitry ID5_306, the reconfiguration managing circuitry ID5_308, the agent managing circuitry ID5_C310 and/or the workload activity detecting circuitry ID5_C312 may be implemented by machine executable instructions disclosed herein and executed by processor circuitry, which may be implemented by the example processor circuitry D212 of
In operation, the example reconfiguration managing circuitry ID5_C308 determines whether a workload has been detected for which optimization efforts have not yet occurred. If not, the example reconfiguration managing circuitry ID5_C308 continues to wait for such an instance. However, in response to the example reconfiguration managing circuitry ID5_C308 detecting a workload to be analyzed, the example agent managing circuitry ID5_C310 invokes a workload agent to be associated and/or otherwise assigned with the workload evaluation (e.g., training). In some examples, the assigned workload agent is a reinforcement learning agent to perform exploration in view of a cost function. The example reconfiguration managing circuitry ID5_C308 identifies candidate hardware resources, such as those communicatively connected via an Edge network, and stores such candidate resources in a storage (e.g., a database, memory) for later reference and consideration.
In connection with available hardware resources identified by the example reconfiguration managing circuitry ID5_C308 (block ID5_D106) and/or such resources stored in the storage, the example agent managing circuitry ID5_C310 calculates and/or otherwise determines optimizations. As disclosed above, optimizations may be represented as models, graphs, such as the example graph of interest ID5_B200. In particular, the example agent managing circuitry ID5_C310 selects a candidate resource from the resource list and the example SLA managing circuitry ID5_C304 retrieves current SLA information associated with the workload. The example hyper parameter tuning circuitry ID5_C306 calculates an optimized graph for the selected resource. In some examples, the hyper parameter tuning circuitry ID5_C306 applies a reinforcement learning model for the assigned agent to process, in which a cost function is evaluated in view of one or more parameters corresponding to the SLA information. Optimized graphs are calculated for all available candidate computing resources, and the example benchmark managing circuitry ID5_C302 packages and/or otherwise embeds the optimization metrics as a union of graphs. Further, the example benchmark managing circuitry ID5_C302 attaches/embeds the union of graphs to the workload so that dynamic decisions may occur in real time during an inference/runtime phase of the workload.
During a runtime phase, the example workload activity detecting circuitry ID5_C312 monitors a platform for whether a labelled workload (e.g., a workload containing a union of graphs) has been invoked. If so, the example SLA managing circuitry ID5_C304 retrieves current SLA information. While SLA information was disclosed above as being previously retrieved, examples disclosed herein acknowledge and address the fact that SLA information may change from time to time depending on, for example, client needs, budget, etc. The example reconfiguration managing circuitry ID5_C308 retrieves current utilization information for the computing resources associated with the above-identified workload invocation. In some examples, utilization information is obtained with the aid of Intel® Resource Director Technology (RDT). Resource information may include, but is not limited to resource availability, current resource utilization (e.g., in view of multiple tenant utilization), and current resource cost (e.g., a dollar-per-cycle cost).
The example SLA managing circuitry ID5_C304 determines whether the currently identified computing resources will satisfy the current SLA parameters and, if so, no further model adjustments are needed. However, in the event of deviations from the SLA parameters, the example reconfiguration managing circuitry ID5_C308 selects an alternate path (e.g., alternate graph) that exhibits predicted SLA compliance to a threshold margin. Such selections have reduced computational requirements because, in part, examples disclosed herein include semantic information that identifies and/or otherwise reveals alternative paths that have already been calculated to achieve desired results. As such, alternative path selection occurs in a relatively faster manner with less computational burdens when compared to traditional techniques. Considering that one or more conditions have changed, the example agent managing circuitry ID5_C310 assigns another agent to re-assess performance of the selected alternate path. The example benchmark managing circuitry ID5_C302 updates the workload with new information corresponding to the newly selected path and the current conditions. The updated workload information includes updated semantic information that forms a part of the union of graphs.
In some examples, SLA performance objectives cannot be met in view of the candidate computing resources available at the current time. In such circumstances, the example benchmark managing circuitry ID5_C302 attempts to renegotiate SLA parameters between competing tenants. Micropayments are provided by the benchmark managing circuitry ID5_C302 to particular tenants as compensation for not meeting the SLA parameter requirements to a threshold degree (e.g., when insufficient SLA requirements are detected). In some examples, the benchmark managing circuitry ID5_C302 provides such micropayments and subsequently moves the affected workload(s) to alternate computing resources to complete workload objectives, sometimes at a reduced performance (e.g., slower). In some examples, the benchmark managing circuitry ID5_C302 allocates micropayments to a first tenant that agrees to relinquish a portion of available resources to a second tenant. As such, the first tenant does not consume that portion of resources so that the second tenant can utilize such resources to accomplish computing tasks. Micropayments provide to and/or otherwise allocated to the first tenant include access to one or more portions of available resources at a subsequent time. In some examples, the micropayments to the first tenant represent a portion of workload resources that is greater than those originally provided to the first tenant. In some examples, the micropayments to the first tenant reflect a quantity of computing cycles corresponding to one or more edge network devices. In some examples, the second tenant receives and/or otherwise obtains the use of computing resources having a reduced latency and the first tenant receives and/or otherwise obtains the use of computing resources having a relatively longer latency.
While an example manner of implementing the optimizing circuitry ID5_C300 of FIG. ID5_C is illustrated in FIG. ID5_C, one or more of the elements, processes and/or devices illustrated in FIG. ID5_C may be combined, divided, re-arranged, omitted, eliminated and/or implemented in any other way. Further, the example benchmark managing circuitry ID5_C302, the example SLA managing circuitry ID5_C304, the example hyper parameter tuning circuitry ID5_C306, the example reconfiguration managing circuitry ID5_C308, the example agent managing circuitry ID5_C310, the example workload activity detecting circuitry ID5_C312 and/or, more generally, the example optimizing circuitry ID5_C300 of FIG. ID5_C may be implemented by hardware, software, firmware and/or any combination of hardware, software and/or firmware. Example hardware implementation include implementation on the example compute circuitry D102 (e.g., the example processor D104) of
Flowcharts representative of example hardware logic, machine-readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the optimizing circuitry ID5_C300 of FIG. ID5_C is shown in FIGS. ID5_D through ID5_F. The machine-readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor and/or processor circuitry, such as the processor D152 shown in the example processor platform D150 discussed above in connection with
The machine-readable instructions described throughout this disclosure may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine-readable instructions as described herein may be stored as data or a data structure (e.g., portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine-readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine-readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc. in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine-readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and stored on separate computing devices, wherein the parts when decrypted, decompressed, and combined form a set of executable instructions that implement one or more functions that may together form a program such as that described herein.
In another example, the machine-readable instructions disclosed throughout this disclosure may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc. to execute the instructions on a particular computing device or other device. In another example, the machine-readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine-readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine-readable media, as used herein, may include machine-readable instructions and/or program(s) regardless of the particular format or state of the machine-readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine-readable instructions described throughout this document can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine-readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example processes of FIGS. ID5_D through ID5_F may be implemented using executable instructions (e.g., computer and/or machine-readable instructions) stored on a non-transitory computer and/or machine-readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc. may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, and (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.
As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” entity, as used herein, refers to one or more of that entity. The terms “a” (or “an”), “one or more”, and “at least one” can be used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., a single unit or processor. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.
The program ID5_D100 of FIG. ID5_D includes block ID5_D102, where the example reconfiguration managing circuitry ID5_C308 determines whether a workload has been detected for which optimization efforts have not yet occurred. If not, the example reconfiguration managing circuitry ID5_C308 continues to wait for such an instance. However, in response to the example reconfiguration managing circuitry ID5_C308 detecting a workload to be analyzed (block ID5_D102), the example agent managing circuitry ID5_C310 invokes a workload agent to be associated and/or otherwise assigned with the workload evaluation (e.g., training) (block ID5_D104). The example reconfiguration managing circuitry ID5_C308 identifies candidate hardware resources (block ID5_D106), such as those communicatively connected via an Edge network, and stores such candidate resources in a storage (e.g., a database, memory) (ID5_D108) for later reference and consideration.
In connection with available hardware resources identified by the example reconfiguration managing circuitry ID5_C308 (block ID5_D106) and/or such resources stored in the storage, the example agent managing circuitry ID5_C310 calculates and/or otherwise determines optimizations (block ID5_D110). Example FIG. ID5_E discloses further detail in connection with calculating optimizations of block ID5_D110. In the illustrated example of FIG. ID5_E, the example agent managing circuitry ID5_C310 selects a candidate resource from the resource list (block ID5_E202) and the example SLA managing circuitry ID5_C304 retrieves current SLA information associated with the workload (block ID5_E204). The example hyper parameter tuning circuitry ID5_C306 calculates an optimized graph for the selected resource (block ID5_E206). The example agent managing circuitry ID5_C310 determines whether there is another candidate resource to consider for optimization calculations (block ID5_E208) and, if so, control returns to block ID5_E202). Otherwise, control returns to the illustrated example of FIG. ID5_D.
Returning to FIG. ID5_D, the example benchmark managing circuitry ID5_C302 packages the optimization metrics as a union of graphs (block ID5_D112, and the example benchmark managing circuitry ID5_C302 attaches the union of graphs to the workload (block ID5_D114) so that dynamic decisions may occur in real time during an inference/runtime phase of the workload.
FIG. ID5_F illustrates an example program ID5_F300 to optimize a workload during runtime conditions. In the illustrated example of FIG. ID5_F, the example workload activity detecting circuitry ID5_C312 monitors a platform for whether a labelled workload (e.g., a workload containing a union of graphs) has been invoked (block ID5_F302). If so, the example SLA managing circuitry ID5_C304 retrieves current SLA information (block ID5_F304). The example reconfiguration managing circuitry ID5_C308 retrieves current utilization information for the computing resources associated with the above-identified workload invocation (block IF5_F306). The example SLA managing circuitry ID5_C304 determines whether the currently identified computing resources will satisfy the current SLA parameters (block ID5_F308) and, if so, no further model adjustments are needed and control returns to block ID5_F302. However, in the event of deviations from the SLA parameters, the example reconfiguration managing circuitry ID5_C308 selects an alternate path (e.g., alternate graph) that exhibits predicted SLA compliance to a threshold margin (block ID5_F310). Considering that one or more conditions have changed, the example agent managing circuitry ID5_C310 assigns another agent to re-assess performance of the selected alternate path (block ID5_F312). The example benchmark managing circuitry ID5_C302 updates the workload with new information corresponding to the newly selected path and the current conditions (block ID5_F314).
Example methods, apparatus, systems, and articles of manufacture to optimize resources in Edge networks are disclosed herein. Further examples and combinations thereof include the following: Example 67 includes an apparatus comprising at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus, a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations, or Application Specific Integrate Circuitry including logic gate circuitry to perform one or more third operations, the processor circuitry to at least one of perform at least one of the first operations, the second operations or the third operations to invoke an exploration agent to identify platform resource devices, select a first one of the identified platform resource devices, generate first optimization metrics for the workload corresponding to the first one of the identified platform resource devices, the first optimization metrics corresponding to a first path, embed first semantic information to the workload, the first semantic information including optimized graph information and platform structure information corresponding to the first one of the identified platform resource devices, select a second one of the identified platform resource devices, generate second optimization metrics for the workload corresponding to the second one of the identified platform resource devices, the second optimization metrics corresponding to a second path, embed second semantic information to the workload, the second semantic information including optimized graph information and platform structure information corresponding to the second one of the identified platform resource devices, and select the first path or the second path during runtime based on (a) service level agreement (SLA) information and (b) utilization information corresponding to the first and second identified platform resource devices.
Example 68 includes the apparatus as defined in example 67, wherein the processor is to determine a utilization deviation corresponding to the first and second ones of the platform resource devices, compare the utilization deviation with the SLA information, and migrate the workload to one of the first path or the second path to satisfy at least one threshold of the SLA information.
Example 69 includes the apparatus as defined in example 67, wherein the SLA information includes at least one of latency metrics, power consumption metrics, resource cost metrics, or accuracy metrics.
Example 70 includes the apparatus as defined in example 67, wherein the processor circuitry is to determine a quantity of tenants participating with the identified platform resource devices when generating the first optimization metrics and the second optimization metrics.
Example 71 includes the apparatus as defined in example 70, wherein the processor circuitry is to embed information corresponding to the quantity of tenants with the first and second semantic information.
Example 72 includes the apparatus as defined in example 67, wherein the processor circuitry is to provide micropayments to a tenant in response to insufficient SLA requirements.
Example 73 includes At least one non-transitory computer readable storage medium comprising instructions that, when executed, cause at least one processor to at least invoke an exploration agent to identify platform resource devices, select a first one of the identified platform resource devices, generate first optimization metrics for the workload corresponding to the first one of the identified platform resource devices, the first optimization metrics corresponding to a first path, embed first semantic information to the workload, the first semantic information including optimized graph information and platform structure information corresponding to the first one of the identified platform resource devices, select a second one of the identified platform resource devices, generate second optimization metrics for the workload corresponding to the second one of the identified platform resource devices, the second optimization metrics corresponding to a second path, embed second semantic information to the workload, the second semantic information including optimized graph information and platform structure information corresponding to the second one of the identified platform resource devices, and select the first path or the second path during runtime based on (a) service level agreement (SLA) information and (b) utilization information corresponding to the first and second identified platform resource devices.
Example 74 includes the at least one computer readable storage medium as defined in example 73, wherein the instructions, when executed, cause the at least one processor to determine a utilization deviation corresponding to the first and second ones of the platform resource devices, compare the utilization deviation with the SLA information, and migrate the workload to one of the first path or the second path to satisfy at least one threshold of the SLA information.
Example 75 includes the at least one computer readable storage medium as defined in example 73, wherein the SLA information includes at least one of latency metrics, power consumption metrics, resource cost metrics, or accuracy metrics.
Example 76 includes the at least one computer readable storage medium as defined in example 73, wherein the instructions, when executed, cause the at least one processor to determine a quantity of tenants participating with the identified platform resource devices when generating the first optimization metrics and the second optimization metrics.
Example 77 includes the at least one computer readable storage medium as defined in example 76, wherein the instructions, when executed, cause the at least one processor to embed information corresponding to the quantity of tenants with the first and second semantic information.
Example 78 includes the at least one computer readable storage medium as defined in example 7, wherein the instructions, when executed, cause the at least one processor to provide micropayments to a tenant in response to insufficient SLA criteria.
Example 79 includes a method to optimize a workload, the method comprising invoking an exploration agent to identify platform resource devices, selecting a first one of the identified platform resource devices, generating first optimization metrics for the workload corresponding to the first one of the identified platform resource devices, the first optimization metrics corresponding to a first path, embedding first semantic information to the workload, the first semantic information including optimized graph information and platform structure information corresponding to the first one of the identified platform resource devices, selecting a second one of the identified platform resource devices, generating second optimization metrics for the workload corresponding to the second one of the identified platform resource devices, the second optimization metrics corresponding to a second path, embedding second semantic information to the workload, the second semantic information including optimized graph information and platform structure information corresponding to the second one of the identified platform resource devices, and selecting the first path or the second path during runtime based on (a) service level agreement (SLA) information and (b) utilization information corresponding to the first and second identified platform resource devices.
Example 80 includes the method as defined in example 79, further including determining a utilization deviation corresponding to the first and second ones of the platform resource devices, comparing the utilization deviation with the SLA information, and migrating the workload to one of the first path or the second path to satisfy at least one threshold of the SLA information.
Example 81 includes the method as defined in example 79, wherein the SLA information includes at least one of latency metrics, power consumption metrics, resource cost metrics, or accuracy metrics.
Example 82 includes the method as defined in example 79, wherein generating the first optimization metrics and the second optimization metrics further includes determining a quantity of tenants participating with the identified platform resource devices.
Example 83 includes the method as defined in example 82, further including embedding information corresponding to the quantity of tenants with the first and second semantic information.
Example 84 includes the method as defined in example 79, further including providing micropayments to a tenant in response to detecting insufficient SLA parameters.
Example 85 includes an apparatus comprising agent managing circuitry to invoke an exploration agent to identify platform resource devices, select a first one of the identified platform resource devices, generate first optimization metrics for the workload corresponding to the first one of the identified platform resource devices, the first optimization metrics corresponding to a first path, and select a second one of the identified platform resource devices, and generate second optimization metrics for the workload corresponding to the second one of the identified platform resource devices, the second optimization metrics corresponding to a second path, benchmark managing circuitry to embed second semantic information to the workload, the second semantic information including optimized graph information and platform structure information corresponding to the second one of the identified platform resource devices, and reconfiguration managing circuitry to select the first path or the second path during runtime based on (a) service level agreement (SLA) information and (b) utilization information corresponding to the first and second identified platform resource devices.
Example 86 includes the apparatus as defined in example 85, wherein the reconfiguration managing circuitry is to determine a utilization deviation corresponding to the first and second ones of the platform resource devices, compare the utilization deviation with the SLA information, and migrate the workload to one of the first path or the second path to satisfy at least one threshold of the SLA information.
Example 87 includes the apparatus as defined in example 85, wherein the SLA information includes at least one of latency metrics, power consumption metrics, resource cost metrics, or accuracy metrics.
Example 88 includes the apparatus as defined in example 85, wherein generating the first optimization metrics and the second optimization metrics further includes determining a quantity of tenants participating with the identified platform resource devices.
Example 89 includes the apparatus as defined in example 88, wherein the benchmark managing circuitry is to embed information corresponding to the quantity of tenants with the first and second semantic information.
Example 90 includes a system comprising means for managing agents to invoke an exploration agent to identify platform resource devices, select a first one of the identified platform resource devices, generate first optimization metrics for the workload corresponding to the first one of the identified platform resource devices, the first optimization metrics corresponding to a first path, select a second one of the identified platform resource devices, and generate second optimization metrics for the workload corresponding to the second one of the identified platform resource devices, the second optimization metrics corresponding to a second path, means for managing benchmarks to embed second semantic information to the workload, the second semantic information including optimized graph information and platform structure information corresponding to the second one of the identified platform resource devices, and means for managing reconfigurations to select the first path or the second path during runtime based on (a) service level agreement (SLA) information and (b) utilization information corresponding to the first and second identified platform resource devices.
Example 91 includes the system as defined in example 90, wherein the means for managing reconfigurations is to determine a utilization deviation corresponding to the first and second ones of the platform resource devices, compare the utilization deviation with the SLA information, and migrate the workload to one of the first path or the second path to satisfy at least one threshold of the SLA information.
Example 92 includes the system as defined in example 90, wherein the SLA information includes at least one of latency metrics, power consumption metrics, resource cost metrics, or accuracy metrics.
Example 93 includes the system as defined in example 90, wherein generating the first optimization metrics and the second optimization metrics further includes determining a quantity of tenants participating with the identified platform resource devices.
Example 94 includes the system as defined in example 93, wherein the means for managing benchmarks is to embed information corresponding to the quantity of tenants with the first and second semantic information.
Example ID5(A) is the apparatus of any of examples 67-72, further including: in response to detecting a target resource is incapable of satisfying the SLA, selecting an alternate path corresponding to a next satisfied metric.
Example ID5(B) is the apparatus of example ID5(A), wherein the next satisfied metric is at least one of a next lowest latency or a next greatest accuracy.
Example ID5(C) is the computer-readable storage medium of any of examples 73-78, further including selecting an alternate path corresponding to a next satisfied metric in response to detecting a target resource is incapable of satisfying the SLA.
Example ID5(D) is the computer-readable storage medium of example ID5(C), wherein the next satisfied metric is at least one of a next lowest latency or a next greatest accuracy.
In examples disclosed herein, a computing environment includes one or more processor cores that execute one or more workloads, such as, but not limited to, an artificial intelligence (AI) model. The example computing environment includes multiple tenants structured to run on the processor cores. During execution, an AI model (e.g., a neural network, a decision tree, a Naïve Bayes classifier, etc.) uses resources of a computing device such as cache and memory. However, these resources are limited in that cache (e.g., last level cache (LLC), level three (L3), level four (L4), etc.) is limited to an amount of cache space (e.g., kilobytes, megabytes, etc.) in a processor, and memory (e.g., kilobytes, megabytes, gigabytes, etc.) is limited to an amount of bandwidth (e.g., a rate of data transfer in units of megabytes per second) available to access the memory (e.g., due to read/write access speeds of the memory, due to speed of a memory bus between the processor and the memory, etc.). The amount of cache space and memory bandwidth available to the AI model directly affects the quality of service (QoS) for the AI model. When a computing device or computing node runs multiple AI models for multiple tenants in a multi-tenant computing environment, the AI models share the available cache space and memory bandwidth of the computing device or computing node. When a computing device or computing node runs multiple AI workloads and non-AI workloads, the AI models share the available cache space and memory bandwidth with the non-AI workloads and the other AI models.
However, a lack of coordination between the AI models on how to share such resources makes it difficult or impossible to maintain suitable QoS levels across the multiple AI models. Examples disclosed herein use a model-based approach to dynamically adapt resource availability across multiple AI models in a computing environment to maintain QoS at suitable levels (e.g., according to service level agreements (SLAs)) to improve execution performance across the multiple AI models in a multi-tenant computing system.
Examples disclosed herein generate a resource utilization model which includes generating any number of candidate models (e.g., runtime models) with varying resource utilization to determine how to allocate resources to different workloads (e.g., AI models) using a rewards-based system. Generally speaking, a given workload typically involves numerous models to be invoked to accomplish computation objectives. Of course, depending on available resources, particular combinations of these models (e.g., all of which can contribute to the computational objectives) perform better or worse. In other examples, underlying computational resources (e.g., logically grouped as a number of nodes) are utilized based on workload demands, in which different tenants make particular demands on the available nodes. Proper selection of these one or more candidate models is useful in many ways to improve performances of workload execution. In addition, this can be used to substantially reduce or eliminate noisy neighbor issues in cloud environments. A noisy neighbor is a tenant of a cloud computing system that monopolizes a large amount of resources, sometimes to the detriment of other tenants.
FIG. ID3_1 is an example environment ID3_100 with an example tenant ID3_102, an example orchestrator circuit ID3_104, a first node ID3_114, and a second node ID3_122. The example first node ID3_114 includes a plurality of workloads. As used herein, a workload may be an instance, a number of applications, a number of artificial intelligence models, etc. The example first node ID3_114 includes a first workload ID3_106, a second workload ID3_110, and a third workload ID3112 (e.g., an nth workload). The example second node ID3_122 includes a fourth workload ID3_116, a fifth workload ID3_118, and a sixth workload ID3_120 (e.g., an nth workload).
The example orchestrator circuit ID3_104 is to place the workloads between the different nodes of the edge network, based on resource utilization data. For example, resource utilization data may be cache size and/or memory bandwidth. Some of the example workloads (e.g., second workload ID3_110) may be tolerable to migration, wherein the example orchestrator circuit ID3_104 may migrate the workload from the first node to a second node. In some examples, the development and/or execution of the artificial intelligence models is altered based on the other workloads operating on the nodes. To illustrate, if an example node has a total of ten (“10”) gigabytes of cache, and an example first workload requires 7 gigabytes, an example second workload may perform at a first level of accuracy (e.g., 95%) with a five (“5”) gigabytes of cache, and the second workload may perform at a second level of accuracy (e.g., 90%) with three (“3”) gigabytes of cache. To accomplish these optimizations, the example orchestrator (as discussed in further detail below) negotiates with the second workload to reduce the cache requirement of the second workload to include the first workload (which requires 7 gigabytes) and the second workload (which requires between 3-5 gigabytes) on the node which has a total of 10 gigabytes of available cache.
In some examples, the orchestrator circuit ID3_104 negotiates with the example nodes (e.g., first node ID3_114, a second node ID3_122) or the example workloads (e.g., first workload ID3_106, second workload ID3_110, fifth workload ID3_118, etc) with any type of incentive (e.g., money credit, a time-based credit to allow resource utilization, etc.).
In the example of ID3_100, an incoming tenant ID3_102 has a new workload to execute. The new workload has specific quality of service (QoS) requirements which the example orchestrator circuit ID3_104 utilizes to determine which node to execute the new workload.
FIG. ID3_3 is a block diagram of the example orchestrator circuit ID3_104. The example orchestrator circuit ID3_104 includes an example data interface circuit ID3_308, an example request validator circuit ID3_310, an example resource manager circuit ID3_312, an example node availability determiner circuit ID3_316, an example candidate model generator circuit ID3_314, an example workload migrator circuit ID3_318, and an example quality of service (QoS) monitor circuit ID3_320. The example orchestrator circuit ID3_104 is in communication with an example tenant ID3_302, an example first node ID3_304 of an edge network, and an example second node ID3_306 of an Edge network.
The example data interface circuit ID3_308 receives a request for a new workload from the tenant ID3_302. The new workload from the tenant ID3_302 includes specific quality of service requirements such as the quality of service may be a function of frequency, cache, memory bandwidth, power, deep learning (DL) precision (e.g., INT8, BF16), DL model characteristics, and/or migration-tolerance.
The example request validator circuit ID3_310 is to determine if the request from the tenant ID3_302 is a legitimate request. The request validator circuit ID3_310 may determine if the request is legitimate based on provisioned policies with revocation list (e.g., by an orchestrator) or data center fleet administrator(s).
The example resource manager circuit ID3_312 is to monitor the nodes of the edge network (e.g., a first node ID3_304, a second node ID3_306). The example resource manager circuit ID3_312 may determine the cache size and memory bandwidth availability of the nodes and determine the workloads currently running on the nodes.
The example node availability determiner circuit ID3_316 negotiates with the example nodes to determine availability for a new workload. The negotiation may include money credit or time credit.
The functionality of the example candidate model generator circuit ID3_314 may be implemented in the example node or may be implemented in the example orchestrator. The example candidate model generator circuit ID3_314 is to generate, for a first artificial intelligence model, a plurality of candidate models with varying resource utilization (e.g., a first candidate model may use a small amount of cache while a second candidate model may use a large amount of cache). The example resource manager may use a resource utilization model which tracks the various inferencing accuracy of the different candidate models, generated by the candidate model generator circuit ID3_314, with different cache size requirements.
The example workload migrator circuit ID3_318 is to determine to migrate (e.g., relocate, move) a workload from an example first node ID3_304 to an example second node ID3_306.
The example QoS monitor circuit ID3_320 is to determine the quality of service of the workloads over time, and in response to a significant drop in quality of service, trigger a migration with the example resource manager circuit ID3_312.
FIG. ID3_4 is a block diagram of an example multi-core computing node ID3_400 (e.g., a computer, a host server, a gateway, etc.) executing multiple AI models ID3_404a-c. The example multi-core computing node ID3_400 includes multiple cores ID3_408a-d which execute the AI models ID3_404a-c. In some examples, the AI models ID3_404a-c execute within a sandbox environment on the multi-core computing node ID3_400. A sandbox environment is a security mechanism that separates running programs in an attempt to mitigate system failures and/or software vulnerabilities from spreading. Executing the AI models ID3_404a-c in a sandbox environment allows the multi-core computing node ID3_400 much more control over the resources the AI models ID3_404a-c can access. The resources may include cache space and/or memory bandwidth. Although three AI models ID3_404a-c and four processing cores ID3_408a-d are shown, examples disclosed herein may be implemented using any other number of AI models and/or any other number of cores. In examples disclosed herein, the AI models ID3_404a-c may be any AI models or programs (e.g., neural networks, decision trees, Naïve Bayes classifiers, etc.). The example multi-core computing node ID3_400 determines which ones of the cores ID3_408a-d execute ones of the AI models ID3_404a-c. The example multi-core computing node ID3_400 includes cache ID3_412 and memory ID33416 that may be allocated to the multiple cores ID3_408a-d to execute ones of the AI models ID3_404a-c. The example cache ID3_412 may be last level cache (LLC) or any other cache for use by the multiple cores ID3_408a-d. The example memory ID3_412 may be dynamic random access memory (DRAM), synchronous DRAM (SDRAM), static random access memory (SRAM), and/or any other type of memory suitable for use as system memory shared by all of the cores ID3_408a-d. In some examples, space on the cache ID3_412 and bandwidth for the memory ID3_416 allocated to ones of the AI models ID3_404a-c may not be sufficient for that AI model ID3_404a-c to execute at an acceptable performance or may be underutilized when an excessive amount of those resources are allocated to ones of the AI models ID3_404a-c. To improve performance related to cache space and memory bandwidth, examples disclosed herein provide an example controller ID3_420 to monitor resource utilizations of the cache ID3_412 and the memory ID3_416 and determine whether to modify allocations of cache size and/or memory bandwidth across the AI models ID3_404a-c.
The example controller ID3_420 may be an artificial intelligence or machine learning system. In examples disclosed herein, the controller ID3_420 utilizes a neural network architecture to generate a resource utilization model for an AI model ID3_404a-c that tracks the resource utilization of the AI model. For example, the controller may generate candidate models with varying resource utilization corresponding to the AI model. For example, the resource utilization model that tracks the resource utilization of the AI model ID3_404a may include a first candidate AI model ID3_512a of FIG. ID3_5, a second candidate AI model ID3_512b of FIG. ID3_5, and the third candidate AI model ID3_512c of FIG. ID3_5. The example controller ID3_420 may use reinforcement learning techniques to better optimize resource utilization models relative to situations where the reinforcement learning techniques are not used. The example controller ID3_420 may also utilize a differential approach. In a differential approach, the controller ID3_420 may contain either a supergraph and/or a supernetwork to create a path in the neural network architecture.
The example controller ID3_420 includes an example monitor circuit ID3_424 and an example analyzer circuit ID3_428. The example monitor circuit ID3_424 collects resource utilization data about the cache ID3_412 and the memory ID3_416. The resource utilization data collected by the example monitor circuit ID3_424 includes, but is not limited to, space utilization of the cache ID3_412, space utilization of the memory ID3_416, and bandwidth utilization of the memory ID3_416. In examples disclosed herein, the bandwidth of the memory ID3_416 indicates how fast data may be accessed in the memory ID3_416. The example monitor circuit ID3_424 provides the collected resource utilization data for access by the analyzer circuit ID3_428. The example monitor circuit ID3_424 periodically or aperiodically collects statistics about the cache ID3_412 and the memory ID3_416 to perform ongoing analyses of space and bandwidth utilizations and allocations. In some examples, the controller ID3_420 modifies the generation of specific candidate models and/or the generation of the overarching resource utilization model based on performance of a previously generated candidate models. In these examples, the performance of a candidate model is based on actual data access latency (e.g., pertaining to memory bandwidth) and/or cache inferencing accuracy (e.g., pertaining to cache size) of the candidate model when compared to the expected latency and/or accuracy of the candidate model.
The example analyzer circuit ID3_428 accesses the collected resource utilization data to generate candidate models representative of different space utilizations of the cache ID3_412 across the multiple AI models ID3_404a-c and different bandwidth utilizations of the memory 116 across the multiple AI models ID3_404a-c. As such, the resource generation models track candidate models, define (e.g., set, track, instantiate) cache space utilization parameter values and memory bandwidth utilization parameter values for the AI model ID3_404a-c. The cache space utilization parameter defines how much space in the cache ID3_412 ones of the AI models ID3_404a-c may utilize which determines the plurality of candidate models that may be generated according to the cache space utilization parameter. The memory bandwidth utilization parameter values define how much bandwidth of the memory ID3_416 ones of the AI models ID3_404a-c may utilize, which determines the plurality of candidate models that may be generated according to the memory bandwidth. For example, if the cache space utilization parameter is a minimum value of two (“2”), a candidate model that uses a cache space value of one (“1”) is not generated. The example analyzer circuit ID3_428 may generate multiple candidate models for the AI models ID3_404a-c to analyze different combinations of cache space utilization values and memory bandwidth utilization values to achieve target QoS levels for the AI models ID3_404a-c. The example analyzer circuit ID3_428 selects one or more of the multiple candidate models to invoke on the cache ID3_412 and the memory ID3_416.
In some examples, the analyzer circuit ID3_428 selects a candidate model based on a comparison of expected resource utilization and actual resource utilization. For example, after resource allocations of a previously selected candidate model are applied in executing the AI models ID3_404a-c, the example monitor circuit ID3_424 collects subsequent actual resource utilization data corresponding to the running AI models ID3_404a-c. In addition, the example analyzer circuit ID3_428 generates subsequent candidate models and corresponding expected resource utilization data to continue analyzing QoS levels and modifying cache space and memory bandwidth allocations across the AI models ID3_404a-c to maintain suitable QoS levels. During such subsequent analyses, the analyzer circuit ID3_428 compares expected resource utilization data to actual resource utilization data. In this manner, the analyzer circuit ID3_428 can select a subsequent candidate model that will more closely satisfy the cache space and memory bandwidth needs of the AI models ID3_404a-c to replace the currently running AI model.
FIG. ID3_5 is an example performance map ID33500 of generated candidate models (e.g., runtime models). The candidate models vary in cache size and/or memory bandwidth utilization. Although examples disclosed herein are described based on cache space and memory bandwidth resources, examples disclosed herein may be implemented with other types of resources including hardware-level resources, operating system-level resources, network resources, etc. In addition, types of resources for analyses may be fixed (e.g., unchangeable) or may be user-selectable and/or system-selectable (e.g., selected by artificial intelligence, selected by a program, selected based on a configuration file, etc.).
In other examples, candidate models may additionally or alternatively be generated to analyze performance tradeoffs between different device types for executing workloads. For example, additional performance maps and/or axes data may be generated to show comparative views of different device types selectable to execute workloads. In such examples, the analyzer circuit ID3_428 may generate candidate models and expected resource utilization data for a workload executed using central processing unit (CPU) versus the workload executed using a graphics processing unit (GPU). In this manner, the example analyzer circuit ID3_428 may facilitate selecting a candidate model to allocate resources and/or select different types of resources to execute workloads. Examples of different types of resources for which candidate models may be generated include, but are not limited to, GPUs, CPUs, and/or cross-architecture processing units (XPUs), etc.
Turning to FIG. ID3_5, the Y-axis ID3_504 of the example performance map ID3_500 represents inferencing accuracy or memory latency, and the x-axis ID3_508 represents cache size or bandwidth usage of candidate models ID3_512a-c, ID3_516a-c. The candidate models ID3_512a-c, ID3_516a-c are for two AI models ID3_504a-c executing on a multi-core computing node ID3_400. In this example, the candidate models ID3_512a-c (that show varying resource utilization) are generated for an AI model ID3_404a, and the candidate models ID3_516a-c (that show varying resource utilization) are generated for an AI model ID3_404b. The example performance map ID3_500 shows a combined representation of cache inferencing accuracy performance versus cache size and memory latency performance versus memory bandwidth. In some examples, the performance values can be normalized along the y-axis ID3_504 and the cache and bandwidth utilization values can be normalized along the x-axis ID3_508 so that impacts on performances for different combinations of cache size and memory bandwidth can be simultaneously shown in a single graph or performance map. Alternatively, two separate performance maps can be generated. In such examples, one performance map can show cache inferencing accuracy performance versus cache size, and another performance map can show memory latency performance versus memory bandwidth.
In examples disclosed herein, cache inferencing accuracy performance represents how often information in cache is used or accessed by subsequent instructions (e.g., cache hits). For example, a larger cache size accommodates caching more information. As such, a processing core can load more information into cache from memory based on inferences that such information will be subsequently accessed. The larger a cache size, the more likely that inferentially loaded information in cache will result in a cache hit. In examples disclosed herein, memory latency performance represents the amount of time it takes to retrieve information from memory. As memory bandwidth increases, memory latency performance improves (e.g., latency decreases).
The example analyzer circuit ID3_428 generates expected resource utilization data for cache size and memory bandwidth for the candidate models ID3_512a-c, ID3_516a-c. In some examples, the analyzer circuit ID3_428 selects the candidate model ID3_512a based on the expected resource utilization data. In such examples, the selected candidate model ID3_512a may be selected based on its expected resource utilization data satisfying a desired performance for the AI model ID3_404a.
In some examples, the analyzer circuit ID3_428 may select the candidate model ID3_512a for the AI model ID3_404a because the candidate model ID3_512a satisfies a desired inferencing accuracy performance for the AI model ID3_404a. In other examples, the analyzer circuit ID3_428 may select the candidate model ID3_512a for the AI model ID3_404a because the candidate model ID3_512a satisfies a desired memory latency for the AI model ID3_404a.
In some examples, the analyzer circuit ID3_428 may run two AI models, and select the candidate model ID3_512a and the candidate model ID3_516c. The selection of models ID3_512a and ID3_516c rewards the AI model that utilizes either cache size or memory bandwidth more effectively.
This example is written in terms of analyzing cache space, but memory bandwidth may be tracked similarly. In the example of FIG. ID3_5, the analyzer circuit ID3_428 generates three candidate models for the first artificial intelligence model ID3_404a with a low cache variant model ID3_516c (e.g., a first candidate model ID3_516c), a medium cache variant model ID3_516b (e.g., a second candidate model ID3_516b), and a high cache variant model ID3_516a (e.g., a third candidate model ID3_516a). In the example of FIG. ID3_5, the analyzer circuit ID3_528 generates three candidate models from the second artificial intelligence model ID3_ID3_404b with a low cache variant model ID3_516c (e.g., a first candidate model ID3_516c), a medium cache variant model ID3_516b (e.g., a second candidate model ID3_516ba, and a high cache variant model ID3_516a (e.g., a third candidate model ID3_516a). In the example of FIG. ID3_5, the analyzer circuit ID3_428 determines that there is not a significant (e.g., drastic, according to a threshold) improvement (e.g., increase) in inferencing accuracy or memory latency due to the increase from the lower cache variants to the higher cache variants for the candidate models ID3_516a-c corresponding to the AI model ID3_404b. In the example of FIG. ID3_5, the analyzer circuit ID3_428 determines that there is a significant improvement (e.g., increase) in inferencing accuracy or memory latency due to the increase from the lower cache variants to the higher cache variants for the candidate models ID3_512a-c corresponding to the AI model ID3_404a. The slope (e.g., the accuracy over cache size or memory bandwidth) for the improvement of the artificial intelligence model ID3_404b is lower than the slope for the improvement of the artificial intelligence model ID3_404a. Based on the limited cache space, the analyzer ID3_428 may determine to not reward the artificial intelligence model ID3_404b and not select the candidate model ID3_516a to run, because the example AI model ID3_404a is able to utilize the additional cache more effectively than the example AI model ID3_404b. In some examples, the resource generation model may be thought of as a function, wherein, for a specific model, an input of cache size on the axis ID3_508 has a corresponding output of inferencing accuracy or memory latency. The resource generation model may describe how a given artificial intelligence model may respond to varying cache sizes.
In the example of FIG. ID3_5, the example analyzer circuit ID3_428 may select the candidate model with the better resource utilization for memory bandwidth and inferencing accuracy or memory latency. As described above, such selections may be based on comparing a slope of the first group of candidate models (ID3_404a) to a slope of the second group of candidate models (ID3_404b).
In other examples, the analyzer circuit ID3_428 may generate candidate models for more than one of the AI models ID3_404a-c at a time. In these examples, the analyzer circuit ID3_428 may select a candidate model for the AI models ID3_404a-c based on optimization rules. For example, the candidate models ID3_516a-c are generated for the AI model ID3_404b, and the candidate models ID3_512a-c are generated for the AI model ID3_404a. In this example, the candidate model ID3_516a has the highest bandwidth usage ID3_508 and the candidate model ID3_516c has the lowest bandwidth usage ID3_508.
In some examples, the example analyzer circuit ID3_428 selects a candidate model ID3_516a-c to invoke without comparing to another candidate model ID3_512a-c. In such examples, the analyzer circuit ID3_428 may select candidate model ID3_516a based on having the highest bandwidth usage ID3_508. In such examples, the analyzer circuit ID3_428 can select a candidate model based on the number of tenants in a multi-tenant computing environment. In some examples, the tenants may be other artificial intelligence models or may be non-AI workloads. In other examples, the analyzer circuit ID3_428 selects candidate models based on QoS needs and/or the performance map ID3_500. For example, the analyzer circuit ID3_428 may compare multiple sets of candidate models ID3_512a-c, ID3_516a-c and selects a candidate model ID3_512a-c, ID3_516a-c to optimize cache inferencing accuracy and/or memory latency. In such an example, the analyzer circuit ID3_428 may select the candidate model ID3_512a for the AI model ID3_404a and the candidate model ID3_516c for the AI model ID3_404b. In this example, the analyzer circuit ID3_428 selects the selected candidate models ID3_512a, ID3_516c as the latency difference between candidate model ID3_516a and candidate model ID3_516c is not as large as the latency difference between candidate model ID3_512a and candidate model ID3_512c.
If the example analyzer circuit ID3_428 has previously generated and selected a candidate model for the AI model ID3_404a-c, the analyzer circuit ID3_428 may compare recently collected resource utilization data with expected resource utilization data of other candidate models. The analyzer circuit ID3_428 may select a new candidate model based on the comparison.
FIG. ID3_6 is a block diagram of an example system flow for the multi-core computing node of FIG. ID3_4. At block ID3_604, the example monitor ID3_424 (
At block ID3_608, the example analyzer circuit ID3_428 (
The example analyzer circuit ID3_428 selects a generated candidate model ID3_512a-c (FIG. ID3_5) to invoke. For example, the analyzer circuit ID3_428 selects one of the candidate models ID3_512a-c for use with AI model ID3_404a. The example analyzer circuit ID3_428 selects one(s) of the candidate models ID3_512a-c based on the collected resource utilization data. For example, if the collected resource utilization data does not satisfy a desired cache inferencing accuracy performance, the example analyzer circuit ID3_428 may select the candidate model ID3_512a based on it having a different cache space utilization expected to improve the cache inferencing accuracy performance. For another example, if the candidate model ID3_512a satisfies a desired memory latency performance, the analyzer circuit ID33428 may select the candidate model ID3_512a based on it having a different memory bandwidth utilization than the collected resource utilization data such that the different memory bandwidth utilization is expected to improve the memory latency performance.
In other examples, the analyzer circuit ID3_428 may generate candidate models for more than one of the AI models ID3_404a-c at a time. In these examples, the analyzer circuit ID3_428 may select a candidate model for the AI models ID3_404a-c based on optimization rules. For example, the candidate models ID3_516a-c are generated for the AI model ID3_404b and the resource utilization models ID3_512a-c are generated for the AI model ID3_404a. In this example, the candidate model ID3_516a has the highest bandwidth usage ID3_508 and the candidate model ID3_516c has the lowest bandwidth usage ID3_508.
In some examples, the example analyzer circuit ID3_428 selects a candidate model ID3_516a-c to invoke without comparing to another candidate model ID3_512a-c. In such examples, the analyzer circuit ID3_428 may select resource utilization model ID3_516a based on having the highest bandwidth usage ID3_508. In such examples, the analyzer circuit ID3_428 can select a candidate model based on the number of tenants in a multi-tenant computing environment. In other examples, the analyzer circuit ID3_428 selects candidate models based on QoS needs and/or performance maps (e.g., the performance map ID3_500 of
In some examples, the monitor circuit ID3_424 creates a collected resource utilization signature to represent a collection of resource utilization data. The monitor circuit ID3_424 may create a collected resource utilization signature for a group of AI models ID3_404a-c. For example, the monitor circuit ID3_424 may create a collected resource utilization signature for the group of AI models containing AI model ID3_404a and AI model ID3_404b and a different collected resource utilization signature for the group of AI models containing AI model ID3_404b and AI model ID3_404c. The collected resource utilization signature contains information about previous candidate models ID3_512a-c, ID3_516a-c selected for a group of AI models, the expected resource utilization data for the previously selected candidate models, and the newly collected resource utilization data for the group of AI models. The analyzer circuit ID33428 may access the collected resource utilization signature to compare the newly generated candidate models ID3_512a-c, ID3_516a-c to past resource utilization data to better select a resource utilization model for the AI models in the group that best optimizes the performance of the computing node ID33400 of FIG. ID3_4 when compared to an example where the collected resource utilization signature is not utilized.
At block ID3_612, the example analyzer circuit ID3_428 provides the selected candidate model to the example monitor circuit ID3_424. At block ID3_616, the example monitor circuit ID3_424 sets the cache size and/or memory bandwidth resources per the selected candidate model. The example monitor circuit ID3_424 may instruct the cache ID3_412 and/or the memory ID3_416 to commit an amount of space and/or bandwidth, respectively, to one or more AI models (e.g., the AI models ID3_404a-c of FIG. ID3_4) based on the selected candidate model. The example monitor circuit ID3_424 may instruct the cache ID3_412 and/or the memory ID3_416 via the multi-cores ID3_408a-d or directly via the cache ID3_412 and/or the memory ID3_416.
The system flow of FIG. ID3_6 may be either one-way or a closed-loop ID3_620. For example, if the system flow is a one-way process, the controller circuit ID3_420 collects resource utilization data for the AI models ID3_404a-c and determines a candidate model to invoke for each AI model ID3_404a-c a single time. In a one-way process, the controller circuit ID3_420 collects resource utilization data and generates candidate models based on expected resource utilization data that is likely to achieve a desired performance. In one-way process examples, the controller circuit ID3_420 does not collect actual resource utilization data of previously applied candidate models for comparison in selecting a subsequent resource utilization model. In other examples, if the system flow of FIG. ID3_6 is a closed-loop process, the controller circuit ID3_420 collects actual resource utilization data and generates subsequent candidate models for the AI models ID3_404a-c in a recurring fashion. In these examples, the controller circuit ID3_420 may analyze actual resource utilization data and performance data associated with previously applied candidate models to analyze and select subsequent candidate models to move closer to a desired performance. Unlike the one-way process, the closed-loop process may continue to generate, analyze, and select subsequent candidate models based on collected resource utilization data and expected resource utilization data.
In some examples, the orchestrator circuit ID3_104 includes means for orchestrating a circuit. For example, the means for orchestrating a circuit may be implemented by orchestrator circuitry ID3_104. In some examples, the orchestrator circuitry ID3_104 may be implemented by machine executable instructions such as that implemented by at least blocks corresponding to FIGS. ID3_7, ID3_8 and/or ID3_9 executed by processor circuitry, which may be implemented by the example processor circuitry D212 of
In some examples, the means for orchestrating ID3_104 includes means for data interfacing, means for node availability determining, means for request validating, means for model generating, means for resource managing, means for workload migrating and means for QoS monitoring, which may be implemented, respectively, by the example data interface circuit ID3_308, the example node availability determiner circuit ID3_316, the example request validator circuit ID3_310, the example candidate model generator circuit ID3_314, the example resource manager circuit ID3_312, the example workload migrator circuit ID3_318 and the example QoS monitor circuit ID3_320.
While an example manner of implementing the example orchestrator circuit ID3_104 of FIGS. ID3_1 and ID3_3, and implementing the example controller circuit ID3_420 of FIG. ID3_4 is illustrated in FIGS. ID3_1, ID3_3 and FIG. ID3_3, one or more of the elements, processes, and/or devices illustrated in FIGS. ID3_1, ID3_3 and/or ID334 may be combined, divided, re-arranged, omitted, eliminated, and/or implemented in any other way. Further, the example data interface circuit ID3_308, the example node availability determiner circuit ID3_316, the example request validator circuit ID3_310, the example candidate model generator circuit ID3_314, the example resource manager circuit ID3_312, the example workload migrator circuit ID3_318, the example QoS monitor circuit ID3_320, the example controller circuit ID3_420 and/or, more generally, the orchestrator circuit ID3_104 of FIGS. ID3_1, ID3_3 and/or ID3_4, may be implemented by hardware, software, firmware, and/or any combination of hardware, software, and/or firmware. Thus, for example, any of the example data interface circuit ID3_308, the example node availability determiner circuit ID3_316, the example request validator circuit ID3_310, the example candidate model generator circuit ID3_314, the example resource manager circuit ID3_312, the example workload migrator circuit ID3_318, the example QoS monitor circuit ID3_320, the example controller circuit ID3_420 and/or, more generally, the orchestrator circuit ID3_104 of FIGS. ID3_1, ID3_3 and/or ID3_4, could be implemented by processor circuitry, analog circuit(s), digital circuit(s), logic circuit(s), programmable processor(s), programmable microcontroller(s), graphics processing unit(s) (GPU(s)), digital signal processor(s) (DSP(s)), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), and/or field programmable logic device(s) (FPLD(s)) such as Field Programmable Gate Arrays (FPGAs). When reading any of the apparatus or system claims of this patent to cover a purely software and/or firmware implementation, at least one of the example data interface circuit ID3_308, the example node availability determiner circuit ID3_316, the example request validator circuit ID3_310, the example candidate model generator circuit ID3_314, the example resource manager circuit ID3_312, the example workload migrator circuit ID3_318, the example QoS monitor circuit ID3_320, the example controller circuit ID3_420 and/or, more generally, the orchestrator circuit ID3_104 of FIGS. ID3_1, ID3_3 and/or ID3_4 is/are hereby expressly defined to include a non-transitory computer readable storage device or storage disk such as a memory, a digital versatile disk (DVD), a compact disk (CD), a Blu-ray disk, etc., including the software and/or firmware. Further still, the example orchestrator circuit ID3_104 of FIGS. ID3_1, ID3_3 and ID3_4 may include one or more elements, processes, and/or devices in addition to, or instead of, those illustrated in FIGS. ID3_1, ID3_3 and ID3_4, and/or may include more than one of any or all of the illustrated elements, processes and devices.
Flowcharts representative of example hardware logic circuitry, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the orchestrator circuit ID3_104 and/or the controller circuit ID3_420 of FIGS. ID3_1, ID3_3 and ID3_4 are shown in FIGS. ID3_7, ID3_8 and ID3_9. The machine readable instructions may be one or more executable programs or portion(s) of an executable program for execution by processor circuitry, such as the processor circuitry D212 shown in the example processor platform D200 discussed below in connection with
The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., as portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc., in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and/or stored on separate computing devices, wherein the parts when decrypted, decompressed, and/or combined form a set of machine executable instructions that implement one or more operations that may together form a program such as that described herein.
In another example, the machine readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc., in order to execute the machine readable instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example operations of FIGS. ID3_7, ID3_8 and/or ID3_9 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on one or more non-transitory computer and/or machine readable media such as optical storage devices, magnetic storage devices, an HDD, a flash memory, a read-only memory (ROM), a CD, a DVD, a cache, a RAM of any type, a register, and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the terms non-transitory computer readable medium and non-transitory computer readable storage medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
“Including” and “comprising” (and all forms and tenses thereof) are used herein to be open ended terms. Thus, whenever a claim employs any form of “include” or “comprise” (e.g., comprises, includes, comprising, including, having, etc.) as a preamble or within a claim recitation of any kind, it is to be understood that additional elements, terms, etc., may be present without falling outside the scope of the corresponding claim or recitation. As used herein, when the phrase “at least” is used as the transition term in, for example, a preamble of a claim, it is open-ended in the same manner as the term “comprising” and “including” are open ended. The term “and/or” when used, for example, in a form such as A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, or (7) A with B and with C. As used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing structures, components, items, objects and/or things, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. As used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A and B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B. Similarly, as used herein in the context of describing the performance or execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to refer to implementations including any of (1) at least one A, (2) at least one B, or (3) at least one A and at least one B.
As used herein, singular references (e.g., “a”, “an”, “first”, “second”, etc.) do not exclude a plurality. The term “a” or “an” object, as used herein, refers to one or more of that object. The terms “a” (or “an”), “one or more”, and “at least one” are used interchangeably herein. Furthermore, although individually listed, a plurality of means, elements or method actions may be implemented by, e.g., the same entity or object. Additionally, although individual features may be included in different examples or claims, these may possibly be combined, and the inclusion in different examples or claims does not imply that a combination of features is not feasible and/or advantageous.
FIG. ID3_7 is a flowchart representative of example machine readable instructions and/or example operations ID3_700 that may be executed and/or instantiated by processor circuitry to determine if an artificial intelligence model is to be moved from a first node to a second node. The machine readable instructions and/or operations ID3_700 of FIG. ID3_700 begin at block ID3_702, at which the example data interface circuit ID3_308 of the example orchestrator circuit ID3_104 receives a workload (e.g., instance, app, artificial intelligence model) with a user-defined quality of service from a tenant ID3_304. For example, the quality of service may be a function of frequency, cache, memory bandwidth, power, DL precision (e.g., INT8, BF16), DL model characteristics, and migration-tolerance.
At block ID3_706, the example request validator circuit ID3_310 determines if the request is valid. For example, the example request validator circuit ID3_310 may determine the request (e.g., workload from the tenant) is valid (e.g., “YES”). Control advances to block ID3_708. Alternatively, the example validator circuit ID3_310 may determine the request is not valid (e.g., “NO”). Control returns to block ID3_704. The example validator circuit ID3_310 may determine the request is valid, as described above.
At block ID3_708, the example resource manager circuit ID3312 gathers a list of the models available and systems with the requested quality of service resources. For example, the example resource manager circuit ID3_312 monitors the node devices (e.g., first node ID3_304, a second node ID3_306) and determines the workloads (e.g., applications, artificial intelligence models) currently running on the node devices.
At block ID3_710, the example resource manager circuit ID3_312 determines if model variations are needed based on the user-requested quality of service. For example, the resource manager circuit ID3_312 may compare the quality of service characteristics of the artificial intelligence models available with the user-requested quality of service. If the example resource manager circuit ID3_312 determines that model variations are not needed (e.g., “NO”), control advances to block ID3_714. If the example resource manager circuit ID3_312 determines that model variations are needed (e.g., “YES”), control advances to block ID3_712.
At block ID3_712, the example candidate model generator circuit ID33314 performs a permutation of the model list. For example, the example candidate model generator circuit ID3_314 may generate a plurality of candidate models for a first artificial intelligence model. If the example minimum cache size is ten (“10”) megabytes, the example candidate model generator circuit ID3_314 may generate a first candidate model with a cache size of ten (“10”) and a second candidate model with a cache size of fifteen (“15”) to determine if an increase in cache size significantly increases performance. Control advances to block ID3_714.
At block ID3_714, the example node availability determiner circuit ID3_316 checks availability of the node devices to find a free and/or otherwise partially available machine that is willing and/or able to negotiate. For example, the example node availability determiner circuit ID3_316 may monitor the first node ID3_304 and the second node ID3_306 and determine that the first node ID3_304 has the required cache space to accept the new workload. If the example node availability determiner circuit ID3_316 is unable to find an available node (e.g., “NO”), control advances to block ID3_716. Alternatively, if the example node availability determiner circuit ID3_316 finds an available node (e.g., “YES”), control advances to block ID3_718.
At block ID3_716, the example orchestrator circuit ID3_104 executes a policy-based action (e.g., a baseboard management controller (BMC) to monitor action(s), ME). After block ID3_716, the example instructions ID3_700 end.
At block ID3_718, the example node availability determiner circuit ID3_316 negotiates with the available node. For example, the example node availability determiner circuit ID3_316 may determine a mapping of CLoS and SRMID). In some examples, money or time may be negotiated. In still other examples, price for the virtual instance can be lowered and/or micropayments may be provided for future rentals. In some examples, the node availability determiner circuit ID3_316 facilitates bidding for resources, and such bidding may be guided by one or more active policies (e.g., aggressive bidding for best latency improvements). In some examples, learned settings may be fed forward via, for example, transfer learning.
At block ID3_720, the example orchestrator circuit ID3_104 (e.g., RDT) negotiates with existing workloads (e.g., applications, instances, artificial intelligence models). If the example orchestrator circuit ID3_104 successfully negotiates with the example existing workloads being executed on the nodes (e.g., first node ID3_304, a second node ID3_306) (e.g., “YES”), control advances to block ID3_722. Alternatively, if the example orchestrator circuit ID3_104 does not successfully negotiate with the example existing workloads being executed on the nodes (e.g., “NO”), control returns to block ID3_714.
At block ID3_722, the example workload migrator circuit ID3_318 determines to migrate a workload (e.g., instance, application, artificial intelligence model) from a first node to a second node. For example, the example workload migrator circuit ID3_318 may determine to migrate a workload (e.g., “YES”), control returns to block ID3_714. For example, the node that negotiated with the node availability determiner circuit ID3_316 may have the cache size for the first (e.g., new) workload. Alternatively, the example workload migrator circuit ID3_318 may determine to not migrate a workload (e.g., “NO”), control advances to block ID3_724. In some examples, the node may not have the cache size, such that a second (e.g., different) workload may be migrated to a second node.
At block ID3_724, the example orchestrator circuit ID3_104 may update the CLoS of existing workloads (e.g., instances, applications, artificial intelligence models).
At block ID3_726, the example orchestrator circuit ID3_104 instantiates the workload (e.g., spin-up the requested instance) requested by the tenant ID3_302. The instructions ID3_700 end.
FIG. ID3_8 is a flowchart representative of example machine readable instructions and/or example operations ID3_800 that may be executed and/or instantiated by processor circuitry to determine the candidate model to select. The instructions ID3_800 begins at block ID3_804 at which the example monitor circuit ID3_424 (FIG. ID3_4) collects resource utilization data. For example, the collected resource utilization data may include, but is not limited to, the space utilization of the cache ID3_412, the space utilization of the memory ID3_416, and/or the bandwidth utilization of the memory ID3_416. In examples disclosed herein, the bandwidth of the memory ID3_416 indicates how fast data may be accessed in the memory ID3_416.
At block ID3_808, the example analyzer circuit ID3_428 (FIG. ID3_1) generates resource utilization models based on the collected resource utilization data. In this example, the generated resource utilization models may define space utilization parameters for an AI model ID3_404a-c of the cache ID3_412. For example, the space utilization parameter may define how much space in the cache ID3_412 the AI model ID3_404a-c may utilize. Also in this example, the generated candidate models may define resource utilization parameters for an AI model ID3_404a-c of the memory ID3_416. The analyzer ID3_428 may define bandwidth utilization parameters for the bandwidth of the memory ID3_416. The bandwidth utilization parameter may define how much bandwidth of the memory ID3_416 the AI model ID3_404a-c may utilize. The analyzer circuit ID3_428 may generate multiple candidate models candidate ID3_512a-c for the AI model ID3_404a-c.
At block ID3_812, the example analyzer circuit ID3_428 selects at least one of the generated candidate models ID3_512a-c. For example, the analyzer circuit ID3_428 selects one of the candidate models ID3_512a-c for use with the AI model ID3_404a. The example analyzer circuit ID3_428 selects one(s) of the candidate models ID3_512a-c based on the collected resource utilization data. For example, if the candidate model ID3_512a shows better inferencing accuracy performance with expected space utilization different than the collected space utilization data, the analyzer circuit ID3_428 may select candidate model ID3_512a. In another example, if the candidate model ID3_512a shows better memory latency performance with expected bandwidth utilization different than the collected bandwidth utilization data, the analyzer circuit ID3_428 may select candidate model ID3_512a.
In other examples, the analyzer circuit ID3_428 may generate candidate models for more than one of the AI models ID3_404a-c at a time. In these examples, the analyzer circuit ID3_428 may select a candidate model for the AI models ID3_404a-c based on optimization rules. For example, the candidate models ID3_516a-c are generated for the AI model ID3_404b and the candidate models ID3_512a-c are generated for the AI model ID3_404a. In this example, the candidate model ID3_516a has the highest bandwidth usage ID3_508 and the candidate model ID3_516c has the lowest bandwidth usage ID3_508.
In some examples, the example analyzer circuit ID3_428 selects a candidate model ID3_516a-c without comparing to another candidate model ID3_512a-c. In such examples, the analyzer circuit ID3_428 may select candidate model ID3_516a based on having the highest bandwidth usage ID3_508. In such examples, the analyzer circuit ID3_428 can select a candidate model based on the number of tenants in a multi-tenant computing environment. In other examples, the analyzer circuit ID3_428 selects candidate models based on QoS needs and/or a performance map. For example, the analyzer circuit ID3_428 may compare multiple sets of candidate models ID3_512a-c, ID3_516a-c and selects a candidate model ID3_512a-c, ID3_516a-c to optimize latency. In such an example, the analyzer circuit ID3_428 may select the candidate model ID3_512a for the AI model ID3_404a and the candidate model ID3_516c for the AI model ID3_404b. In this example, the analyzer circuit ID3_428 selects the selected candidate models ID3_512a, ID3_516c as the latency difference between candidate model ID3_516a and candidate model ID3_516c is not as large as the latency difference between candidate model ID3_512a and candidate model ID3_512c.
The analyzer circuit ID3_428 generates expected resource utilization data for the selected resource utilization model. If the analyzer circuit ID3_428 generates additional resource utilization models for the AI model ID3_404a, the generated resource utilization models may be based on the expected resource utilization data for the selected model.
At block ID3_816, the example monitor circuit ID3_424 allocates resources based on the selected resource utilization model. The example monitor circuit ID3_424 may instruct the cache ID3_412 and/or the memory ID33416 to commit an amount of space and/or bandwidth, respectively, to an AI model (e.g., an AI model ID3_404a-c of
At block ID3_818, the example computing node ID3_400 executes AI models ID3_404a-c. For example, the AI models ID3_404a-c executes using the resources based on the at least one selected resource utilization models.
At block ID3_820, the example monitor circuit ID3_424 collects actual resource utilization data from the cache ID3_412 and the memory ID3_416 and/or actual performance of the AI models ID3_404a-c. The collected actual resource utilization data may include, but is not limited to, the space utilization of the cache ID3_412, the space utilization of the memory ID3_416, and/or the bandwidth utilization of the memory ID3_416. The actual performance may include cache inferencing accuracy and/or memory latency.
At block ID3_824, the example analyzer circuit ID3_428 compares the collected actual resource utilization data and/or actual performance to the expected resource utilization data and/or expected performance of one or more subsequent resource utilization models. The example performance map ID33500 of FIG. ID3_5 indicates the expected resource utilization data and expected performance of the example candidate models ID3_512a-c, ID3_516a-c. For example, the analyzer circuit ID33428 may select the candidate model ID3_512a. In this example, the analyzer circuit ID3_428 saves the expected resource utilization data of the candidate model ID3_512a. The collected actual resource utilization data and/or performance may be compared to the expected resource utilization data and/or performance. A subsequent candidate model may be selected based on the comparison showing improved performance.
At block ID3_828, the example analyzer circuit ID3_428 determines whether to continue modifying resource utilization of the cache ID3_412 and/or the memory ID3_416. In a one-way process, the analyzer circuit ID3_428 determines to not continue modifying resource utilization of the cache ID3_412 and/or the memory ID3_416. However, in a closed-loop process, the analyzer circuit ID3_428 may determine to either continue modifying resource utilization or to not continue modifying resource utilization (e.g., based on measured performances of the AI models ID3_404a-c). If the example analyzer circuit ID3_428 determines to continue generating resource utilization models, the process returns to block ID3_804. If the example analyzer circuit ID3_428 determines to not continue generating resource utilization models, the instructions ID3_800 of FIG. ID3_8 end.
FIG. ID3_9 is an overview of example operations ID3_900 that may be executed by the processor to execute the instructions to orchestrate the migration of workloads in an edge network.
As discussed above,
From the foregoing, it will be appreciated that example systems, methods, apparatus, and articles of manufacture have been disclosed that improve efficiency of computing devices that share resources. In particular, examples disclosed herein remove operator discretion regarding which models to apply to resource allocation to workloads, and enable the negotiation of workloads competing for same/similar resources. The disclosed systems, methods, apparatus, and articles of manufacture are accordingly directed to one or more improvement(s) in the operation of a machine such as a computer or other electronic and/or mechanical device.
Example methods, apparatus, systems, and articles of manufacture to optimize resources in Edge networks are disclosed herein. Further examples and combinations thereof include the following: Example 1 includes an apparatus comprising at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus, a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations, or Application Specific Integrate Circuitry including logic gate circuitry to perform one or more third operations, the processor circuitry to at least one of perform at least one of the first operations, the second operations or the third operations to generate first candidate models, generate second candidate models, the first candidate models to be allocated first resources and the second candidate models to be allocated second resources, collect first resource utilization data corresponding to a workload executing the first resources, collect second resource utilization data corresponding to the workload executing the second resources, calculate a first slope corresponding to workload performance and the first resources, calculate a second slope corresponding to workload performance and the second resources, and select one of the first candidate models or the second candidate models based on a comparison of the first and second slope.
Example 2 includes the apparatus as defined in example 1, wherein the processor circuitry is to select the first candidate models or the second candidate models based on a threshold slope value.
Example 3 includes the apparatus as defined in example 1, wherein the processor circuitry is to allocate a first quantity of cache to the workload.
Example 4 includes the apparatus as defined in example 3, wherein the first quantity of cache corresponds to a first node of a computing platform.
Example 5 includes the apparatus as defined in example 1, wherein processor circuitry is to acquire the first resource utilization data as at least one of workload accuracy or workload latency.
Example 6 includes the apparatus as defined in example 1, wherein the workload includes a plurality of artificial intelligence models from dissimilar tenants of a multi-tenant computing environment.
Example 7 includes the apparatus as defined in example 1, wherein the processor circuitry is to determine whether the first candidate models and the second candidate models tolerate migration.
Example 8 includes the apparatus as defined in example 7, wherein the processor circuitry is to skip integration of models that do not tolerate migration.
Example 9 includes At least one non-transitory computer readable storage medium comprising instructions that, when executed, cause at least one processor to at least monitor, in a first phase, a hardware platform to identify features to train an artificial intelligence model, extract information regarding the hardware platform associated with a marker related to the features occurring during the first phase, and store the information in a database to, in a second phase, enable hardware-aware training of the artificial intelligence model.
Example 10 includes the at least one computer readable storage medium as defined in example 9, wherein the instructions, when executed, cause the at least one processor to select the first candidate models or the second candidate models based on a threshold slope value.
Example 11 includes the at least one computer readable storage medium as defined in example 9, wherein the instructions, when executed, cause the at least one processor to allocate a first quantity of cache to the workload.
Example 12 includes the at least one computer readable storage medium as defined in example 11, wherein the first quantity of cache corresponds to a first node of a computing platform.
Example 13 includes the at least one computer readable storage medium as defined in example 9, wherein the instructions, when executed, cause the at least one processor to acquire the first resource utilization data as at least one of workload accuracy or workload latency.
Example 14 includes the at least one computer readable storage medium as defined in example 9, wherein workload includes a plurality of artificial intelligence models from dissimilar tenants of a multi-tenant computing environment.
Example 15 includes the at least one computer readable storage medium as defined in example 9, wherein the instructions, when executed, cause the at least one processor to determine whether the first candidate models and the second candidate models tolerate migration.
Example 16 includes the at least one computer readable storage medium as defined in example 15, wherein the instructions, when executed, cause the at least one processor to skip integration of models that do not tolerate migration.
Example 17 includes a method comprising generating first candidate models, generating second candidate models, the first candidate models to be allocated first resources and the second candidate models to be allocated second resources, collecting first resource utilization data corresponding to a workload executing the first resources, collecting second resource utilization data corresponding to the workload executing the second resources, calculating a first slope corresponding to workload performance and the first resources, calculating a second slope corresponding to workload performance and the second resources, and selecting one of the first candidate models or the second candidate models based on a comparison of the first and second slope.
Example 18 includes the method as defined in example 17, further including selecting the first candidate models or the second candidate models based on a threshold slope value.
Example 19 includes the method as defined in example 17, further including allocating a first quantity of cache to the workload.
Example 20 includes the method as defined in example 19, wherein the first quantity of cache corresponds to a first node of a computing platform.
Example 21 includes the method as defined in example 17, further including acquiring the first resource utilization data as at least one of workload accuracy or workload latency.
Example 22 includes the method as defined in example 17, wherein the workload includes a plurality of artificial intelligence models from dissimilar tenants of a multi-tenant computing environment.
Example 23 includes the method as defined in example 17, further including determining whether the first candidate models and the second candidate models tolerate migration.
Example 24 includes the method as defined in example 23, further including skipping integration of the models that do not tolerate migration.
Example 25 includes a system comprising means for analyzing to generate first candidate models, and generate second candidate models, the first candidate models allocated a first resource allocation and the second candidate models allocated a second resource allocation, and means for monitoring to collect first resource utilization data corresponding to a workload executing the first resource allocation, collect second resource utilization data corresponding to the workload executing the second resource allocation, the means for analyzing to calculate a first slope corresponding to workload performance and first resource allocation, calculate a second slope corresponding to workload performance and second resource allocation, and select one of the first candidate models or the second candidate models based on a comparison of the first and second slope.
Example 26 includes the system as defined in example 25, wherein the means for analyzing is to select the first candidate models or the second candidate models based on a threshold slope value.
Example 27 includes the system as defined in example 25, wherein the first candidate models allocate a first quantity of cache to the workload.
Example 28 includes the system as defined in example 27, wherein the first quantity of cache corresponds to a first node of a computing platform.
Example 29 includes the system as defined in example 25, wherein the first resource utilization data includes at least one of workload accuracy or workload latency.
Example 30 includes the system as defined in example 25, further including means for orchestrating to determine whether the first candidate models and the second candidate models tolerate migration.
Example 31 includes an apparatus to adapt workload model selection comprising an analyzer circuit to generate first candidate models, and generate second candidate models, the first candidate models allocated a first resource allocation and the second candidate models allocated a second resource allocation, and a monitor circuit to collect first resource utilization data corresponding to a workload executing the first resource allocation, collect second resource utilization data corresponding to the workload executing the second resource allocation, the analyzer circuit to calculate a first slope corresponding to workload performance and first resource allocation, calculate a second slope corresponding to workload performance and second resource allocation, and select one of the first candidate models or the second candidate models based on a comparison of the first and second slope.
Example 32 includes the apparatus as defined in example 31, wherein the analyzer circuit is to select the first candidate models or the second candidate models based on a threshold slope value.
Example 33 includes the apparatus as defined in example 31, wherein the first candidate models allocate a first quantity of cache to the workload.
Example 34 includes the apparatus as defined in example 33, wherein the first quantity of cache corresponds to a first node of a computing platform.
Example 35 includes the apparatus as defined in example 31, wherein the first resource utilization data includes at least one of workload accuracy or workload latency.
Example 36 includes the apparatus of example 31, wherein the workloads are a plurality of AI models of different tenants of a multi-tenant computing environment.
Example 37 includes the apparatus as defined in example 31, further including an orchestrator circuit to determine whether the first candidate models and the second candidate models tolerate migration.
Example ID3(A) is the apparatus of any of examples 1-8, further including a plurality of tenants executing at least one of neural networks, decision trees or Naïve Bayes classifiers with at least one of last level cache, level three cache or level four cache.
Example ID3(B) is the apparatus of any of examples 1-8, further including selecting combinations of the candidate models based on relative performance metrics at a first time with first tenants, and selecting alternate ones of the combinations of the candidate models based on the relative performance metrics at a second time with second tenants.
Example ID3(C) is the computer readable storage medium of any of examples 9-16, wherein the instructions, when executed, cause at least one processor to at least select combinations of the candidate models based on relative performance metrics at a first time with first tenants, and select alternate ones of the combinations of the candidate models based on the relative performance metrics at a second time with second tenants.
Example ID3(D) is the method of any of examples 17-24, further including selecting combinations of the candidate models based on relative performance metrics at a first time with first tenants, and selecting alternate ones of the combinations of the candidate models based on the relative performance metrics at a second time with second tenants.
Although certain example systems, methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the claims of this patent.
Machine learning (ML) and/or other artificial intelligence (AI) model formation (e.g., training, testing, and deployment) involves leveraging a data set of information. Often, the quality and amount of data in the data set determines the quality of the resulting model. Selection of which data in a data set is important for training, testing, and/or validation of a model is referred to as featurization. A feature store or database is a library of featurizations that can be applied to data of a certain type. Featurization data stored in the database can then be used to train, test, etc., a machine learning network construct.
Manual selection of features is difficult, if not impossible. Correlations, verifications, and/or other analysis cannot be done manually by a human. As such, model quality suffers, which results in erroneous model outcomes and introduces failures or faults in systems and processes relying on the model output to function. Automated featurization can remedy these problems and improve the operation and accuracy of ML/AI models and associated systems, processes, etc.
Automated machine learning (referred to as automated ML or AutoML) automates formation of ML and/or other AI models. In AutoML, a plurality of pipelines are created in parallel to test a plurality of algorithms and parameters for. The service iterates through ML algorithms paired with feature selections, where each iteration produces a model with a training score.
Automated featurization can help drive AutoML by automatically identifying, capturing, and leveraging relevant, high-quality feature data for model training, testing, validation, etc. For example, featurization automation can assist an ML algorithm to learn better, forming an improved ML model for deployment and utilization. Featurization can include feature normalization, missing data identification and interpolation, format conversion and/or scaling, etc. Featurization can transform raw data into a set of features to be consumed by a training algorithm, etc. In certain examples, featurization is based on an analysis of an underlying platform (e.g., hardware, software, firmware, etc.) on which a model will be operating.
Automated featurization provides important benefits in an edge computing environment, an Internet of Things (IoT) environment (which may be an edge computing environment or other network infrastructure, etc.), cloud computing environment, etc. In such environments, resource capabilities can vary greatly, and providing an ability for quick, automated featurization of available resources enables the automation of AI algorithm/model optimization on edge/IoT devices, etc. For example, an IoT device, such as a user device, a cloud server rack, etc., can include a system for automated featurization. Such a system can be provided in silicon as part of a platform providing a software and AI framework, for example.
To date, efforts to scale deployment of AI algorithms on computing platforms have been manual, involving a massive human effort. Such manual effort also renders impossible large scale optimization across multiple customers, multiple models, and multiple products. As described herein, automated featurization addresses these challenges and enables hardware-aware neural architecture exploration based on automated featurization of underlying platform characteristics for a given workload. As such, new AI features can be discovered, and machine-generated AI algorithms can be optimized for particular hardware and/or software platforms. Certain examples provide automated featurization that can scale across large data sets, diverse data sets, and many use cases. Certain examples enable design of AI algorithm and hardware for rapid discovery of new AI features and design of AI silicon.
Certain examples provide hardware-aware AutoML, which leverages one or more underlying hardware platforms or architectures when training ML and/or other learning network models. To enable AutoML, a featurization search is conducted to identify hardware (and software and/or firmware) characteristics that can be converted to features for training, testing, etc., of an AutoML model. A challenge associated with hardware-aware AutoML is exposing underlying hardware capability to a featurization search. In certain examples, hardware characteristics are identified and input into the automated featurization search. The data set formed for training and/or testing the ML network model can then be more focused on the platform on which the model is to be deployed.
In certain examples, a featurization search is conducted to analyze a workload associated with (e.g., being executed using) a hardware platform. For example, operations such as latency, energy, memory bandwidth, other resource utilization, etc., can be measured in a search space to help identify features of the hardware platform. Rather than manual measurement of operations, which is very resource intensive and time consuming, if not impossible, a search space for a hardware platform can be defined in terms of blocks (also referred to as code blocks) and cells for a given workload. Portions of the underlying hardware can be exposed in connection with specified code blocks. Microcode hooks and telemetry information from the underlying hardware platform can be used to automatically gather data associated with code blocks in the featurization search.
In certain examples, “mile markers” or other indicators are associated with a start and/or end of a code block, other microcode benchmark, etc. Values (e.g., parameter values, register values, code execution status values, etc.) captured at mile markers can be saved in a database and/or other data store to form a basis for automated hardware-aware neural architecture search and network training using identified features, for example. For example, during a warm-up phase in which search and exploration of an AI algorithm begins to be optimized and/or otherwise tailored for an underlying platform, mile markers can be specified to allow the underlying hardware to collect statistics such as number of cycles required by an associated code block, an amount of data movement involved, etc. After statistics have been gathered and saved (e.g., in a database) in the warm-up phase, the statistics can be used to generate (e.g., train, test, validate, etc.) an ML or other AI network that is optimized and/or otherwise tailored for the underlying hardware platform on which the network will execute, for example. As such, network model accuracy and effectiveness can be improved while reducing front end overhead from a manual search or training process.
In certain examples, mile markers associated with the code or compute blocks trigger microcode executing with respect to the hardware underneath the code block to capture telemetry and code execution statistics with respect to the underlying hardware. The statistics can be “frozen” and saved (e.g., in a database, other data store, etc.). The statistics can be made available for a neural network architecture search, etc. In certain examples, as information is being captured, the statistics/information can be stored in a “global” database that includes statistics from multiple hardware platforms. Information in the global database can enable cross-platform model optimization, exploration for future hardware, etc.
Mile markers provide a flexible, dynamic ability to explore and evaluate an underlying hardware platform with or without one or more workloads for execution. In certain examples, mile markers can be automatically placed according to one or more criterion (e.g., a configurable policy, one or more parameters, etc.) and can be automatically identified, captured, and saved for further use in neural architecture exploration, for example. For example, rather than running a full workload, an automated featurization search process can run mile markers and collect information from those mile markers to compare different configurations and/or different architectures as a proxy for estimating data movement, latency, and/or other characteristics from workload execution on the underlying hardware platform. If a mile marker is known, the marker can be associated with the position of known micro-operation benchmarks such as memory access, etc., not only hardware. As such, the benchmark can be leveraged without executing the mile marker, for example. In other examples, a mile marker can be simulated (e.g., in association with a convolution, etc.) to obtain a value of the mile marker without actually running the underlying hardware. As such, during a training and exploration phase, mile markers can be used in a variety of ways to obtain hardware events from running, simulating, etc., the mile marker. Depending on how a mile marker is leveraged, exploration can run faster, have higher fidelity, etc.
In certain examples, a mile marker can provide a specific micro benchmark exploration. A marker can be added to understand a certain hardware characteristic. In certain examples, a hierarchy of micro-operation levels can be constructed, and mile markers can be examined to determine how the mile markers impact the hierarchy. As such, mile markers can be used to understand what is happening underneath the microarchitecture pipeline.
In certain examples, mile markers can be leveraged for more than a feature or characteristic analysis of an underlying platform. For example, mile markers can be used to evaluate microcode patches for deployment across one or more hardware platforms. Telemetry from mile markers can help to tune and optimize microcode deployment by leveraging platform telemetry statistics. Extensions can help to find an optimal search space for microcode patches (e.g., can be customized per platform rather than one-size-fits-all approach), for example.
Alternatively or additionally, mile markers can be leveraged across platform elements and protocols. For example, mile markers can be extended from microcode patch efficiency monitoring and tuning to correspond to other logic blocks in a platform running firmware. Mile markers corresponding to a start and/or end of such logic blocks can be leveraged to tune different platform elements, protocols, etc.
Captured information and associated analysis can be saved in a cloud-accessible data store, made available to local applications, etc., to affect change(s) in local and remote platform behavior. For example, information can be exposed to applications running locally on a hardware platform to affect a change in and/or otherwise modify platform behavior and/or application behavior via the location application(s). In certain examples, characterization and tuning of individual logic blocks, groups of logic blocks, an entire platform, etc., can be crowd-sourced across a fleet or cluster of servers from one or more data centers in a hyper-cloud deployment for workload deployment and fleet management at scale.
FIG. ID4_A is a block diagram of an example AutoML apparatus ID4_A00 including an example featurization search system ID4_A10, an example hardware platform ID4_A20, and an example model developer circuitry ID4_A30. The example featurization search system ID4_A10 analyzes the example hardware platform ID4_A20, which includes hardware, software, and firmware, to determine features associated with the example platform ID4_A20. The example featurization search system ID4_A10 divides and/or otherwise organizes the hardware and software of the example platform ID4_A20 into blocks. Mile markers or event indicators (e.g., code start, code end, microcode benchmark, etc.) are captured by the example featurization search system ID4_A10 examining the configuration and activity of the example platform ID4_A20. The example featurization search system ID4_A10 saves the captured mile marker values and other associated data, analytics, etc., to form features associated with the example platform ID4_A20. Features can be organized into one or more data sets to drive development of an AI model (e.g., for training, testing, etc., of an ML network and/or other AI model), for example. The features can serve as a basis for automated hardware-aware neural architecture search and network training using the identified features saved, aggregated, normalized, and/or otherwise processed as one or more data sets, for example. In the example of FIG. ID4_A, features and other extracted and/or evaluated information can be used by the example model developer circuitry ID4_A30 to generate, train, test, and deploy AI models, for example. For example, identified features and associated telemetry data can be used to form one or more data sets for training, testing, and/or other validation of an AI model construct (e.g., AI model development).
FIG. ID4_B illustrates an example implementation of the example featurization search system ID4_A10 of FIG. ID4_A. The example featurization search system ID4_A10 includes an example block analyzer circuitry ID4_B10, an example architecture searcher circuitry ID4_B20, an example network trainer circuitry ID4_B30, an example marker capturer circuitry ID4_B40, and an example database ID4_A50. The example block analyzer circuitry ID4_B10 defines an example feature search space of a workload associated with the example hardware platform circuitry ID4_A20, for example, in terms of code or computation blocks or cells. A beginning and/or end of a block can be associated with a marker (e.g., a “mile marker”) or other indicator associated with an action or state (e.g., a hardware, platform, or code state, etc.) at a beginning or end of execution of the associated code block by the underlying processing hardware of the example platform circuitry ID4_A20. The example block analyzer circuitry ID4_B10 communicates with the example architecture searcher circuitry ID4_B20 to provide the code blocks and associated markers during a warm-up phase of architecture search with respect to the example hardware platform ID4_A20.
The example architecture searcher circuitry ID4_B20 works with the example marker capturer ID4_B40 and the example network trainer circuitry ID4_B30 to capture values associated with markers as the code blocks are executed in conjunction with training of a network (e.g., a ML network model, neural network model, other AI model, etc.) by the example network trainer circuitry ID4_B30 during an initial, warm-up phase of architecture exploration and training. For example, the architecture searcher circuitry ID4_B20 monitors execution of software code, microcode, etc., on the hardware platform ID4_A20 and/or facilitates simulation of software code, microcode, etc., with respect to the example platform ID4_A20. As the architecture searcher ID4_B20 monitors real and/or simulated execution to evaluate the platform ID4_A20 with respect to the network, the example marker capturer ID4_B40 captures markers and/or other indicators associated with a hardware and/or software state at a start and/or end of the associated code block and/or microcode/micro-operation benchmark, etc. Captured values can be saved in the example database ID4_B50 to drive a next stage or phase of neural architecture exploration using the example architecture searcher circuitry ID4_B20 and the example network trainer circuitry ID4_B30. For example, captured values can include hash key (or embedding) of a micro block/operation along with input/output dimensions used as an index to an entry in the database ID4_B50 to store parameters such as latency, memory footprint, power, etc. Using information extracted at the marker(s) reveals characteristics of the underlying hardware and allows the example architecture searcher circuitry ID4_B20 to be hardware away in its architecture search and training of an ML and/or other AI network using the example network trainer circuitry AD4_B30, for example.
As such, in operation during a warm-up phase, the example architecture searcher circuitry AD4_B20 begins optimizing or otherwise improving an AI algorithm for an underlying platform in conjunction with the example network trainer circuitry ID4_B30. The example marker capturer circuitry ID4_B40 measures and/or otherwise captures hardware statistics associated with software program code execution (e.g., in conjunction with microcode evaluation and capture, etc.), such as a number of cycles associated with execution of the code block, an amount of data movement associated with execution of the code block, etc. In certain examples, code (e.g., microcode, program code, etc.) associated with a mile marker can be simulated, rather than actually executed on the underlying platform ID4_A20. The gathered statistics are saved by the example marker capturer ID4_B40 in the example database ID4_B50 to be used, alone or in conjunction with data from the same and/or other hardware platform(s), in an exploration phase to develop an ML and/or other hardware-aware AI architecture by incorporating hardware features from the database ID4_B50 via the example architecture searcher ID4_B20 and the example network trainer circuitry ID4_B30 to identify and train an AutoML and/or other AI network with respect to the underlying hardware platform ID4_A20 on which the network will execute, for example. The marker data in the database ID4_B50 can be used to form or drive a feature engine, tailored to the platform ID4_A20, to model features for analysis with respect to the platform ID4_A20, for example.
FIG. ID4_C is an implementation of the example hardware platform ID4_A20 on which the example featurization search system ID4_A10 of FIGS. ID4_A-ID4_B operates and/or can be implemented. The example hardware platform ID4_A20 includes one or more applications ID4_C10, virtual machines (VMs) ID4_C12, an operating system (OS)/virtual memory manager (VMM) ID4_C14 executing with respect to a unified extensible firmware interface (UEFI)/basic input/output system (BIOS) ID4_C16 in a software layer ID4_C20 of the example hardware computing platform ID4_A20 of FIG. ID4_B. As shown in the example of FIG. ID4_C, the example UEFI/BIOS ID4_C16 includes a microcode update manager ID4_C18. The example microcode update manager ID4_C18 works with an example instruction set architecture (ISA) manager circuitry ID4_C30 to configure specific ISA behavior, evaluate ISA configuration, and take actions based on associated policy.
As shown in the example of FIG. ID4_C, a hardware layer ID4_C40 includes a system-on-a-chip (SoC) and/or other processor/processing hardware. The example SoC/hardware ID4_C50 includes microcode ID4_C52 and secure microcode ID4_C54 to facilitate hardware operations of the example hardware circuitry ID4_C50. The example microcode ID4_C52 and secure microcode ID4_C54 interpose an organization layer between the hardware ID4_C50 and the ISA of the example processing platform ID4_A20 of FIG. ID4_C.
The example ISA manager ID4_C30 can be used to implement all or part of the example marker capturer circuitry ID4_B40 of FIG. ID4_B. While the example of IFG. ID4_C shows the example ISA manager circuitry ID4_C30 implemented as part of the example platform ID4_A20, all or part of the ISA manager circuitry ID4_C30 (e.g., at last the marker capturer circuitry ID4_B40, etc.) can be implemented as part of the featurization search system ID4_A10, etc. As shown in the example of FIG. ID4_C, the example ISA manager circuitry ID4_C40 implements the example marker capturer circuitry ID4_B40 using an example telemetry manager ID4_C32, an example microoperation (UOP) surplus mapper ID4_C34, an example ISA evaluator ID4_C36, and an example ISA decoder ID4_C38. The example ISA manager ID4_C40 communicates via a network ID4_C60 with a cloud-based server ID4_C70 and/or other computing device. As such, the example marker capturer circuitry ID4_B40 can provide captured marker data to the example server ID4_C70 via the example network ID4_C60. An external device can access the marker data via the example server ID4_C70, for example.
As shown in the example of FIG. ID4_C, the example telemetry manager ID4_C32 can capture one or more key performance indicators through interaction with the example microcode update manager ID4_C18 in the software (SW) layer ID4_C20 monitoring microcode ID4_C52, ID4_C54 in the hardware (HW) layer ID4_C40. The example UOP-surplus mapper ID4_C34 generates a new ISA and/or mile marker for execution by the microcode ID4_C52, ID4_C54. The example ISA evaluator ID4_C36 dispatches the ISA and/or associated mile marker and evaluates information associated with a state of the hardware and/or software at the mile marker. The example ISA decoder ID4_C38 processes the ISA and/or mile marker information for inclusion in the example database ID4_B50.
As such, certain examples provide automated performance monitoring for an example hardware base ID4_C40 and associated software ID4_C20. Example mile markers and associated data can be captured against multiple platforms to create an offline and/or online database ID4_B50 to enable cross platform model optimization, for example. The example microcode dispatcher ID4_C52 and/or ID4_C54 can execute and/or capture information with respect to one or more mile markers and convey information to the example microcode manager ID4_C18 to be provided to the example ISA manager circuitry ID4_C30. Latency, energy, bandwidth, etc., can be estimated from one or more markers/motifs using composition of micro-benchmarks running on the example platform ID4_A20, performance modeling or simulation based on mile markers, etc.
Data associated with the captured/simulated mile markers, stored in the example database ID4_B50 and/or used in real-time, is exposed to one or ML/DL frameworks to optimize a resulting network, such as an AutoML network being trained by the example network trainer circuitry ID4_B30, etc. In certain examples, mile marker data collected online can be merged or fused with data collected offline (e.g., data fusion for time-series data, etc.). As such, a model can continue to reflect actual workloads even after the platform has been deployed.
In certain examples, data in the database ID4_B50 stores data from mile markers captured across a plurality of hardware and/or software versions/generations, configurations/form factors of a same version/generation, etc. For example, mile marker data collected from a twenty core part, a fifty-six core part, a one hundred fifty watt thermal design power (TDP) tool, etc., can be stored, leveraged, and shared via the database ID4_B50. In certain examples, the cloud server ID4_C70 can leverage telemetry and mile marker data from the database ID4_B50 (e.g., implemented as a cloud-accessible database) to identify one or more bottlenecks such as a compute bottleneck, a memory bottleneck, an interconnect bottleneck, a communication bottleneck, etc. Lessons learned from one deployment can be fed forward so that future neural network searches can learn from prior data collection and analysis, for example. Mile markers can be augmented across attachment points, accelerators, etc. Mile markers can be provided across accelerator attachment points, for example.
For example, using compute express link (CXL)™ interconnect technologies, mile marker data collection and analysis can be scaled to other attachment points. Accelerator attachment points and communication protocols can be analyzed and determined. Elements of a heterogenous architecture can be examined to evaluate a change in mile markers based on a change in accelerator attachment points, communication protocols, central processing unit (CPU) usage, general processing unit (GPU) usage, etc. Telemetry can be exposed from the database ID4_B50 to the cloud server ID4_C70 and/or to one or more applications ID4_C10, etc., running on the example platform ID4_A20. As such both local applications ID4_C10 and remote systems can benefit from the mile marker capture, telemetry analysis, etc., to enable change based on platform ID4_A20 behavior, for example.
In certain examples, microcode patches can be evaluated for deployment. Telemetry from mile markers can be used to help tune and improve microcode deployment by leveraging the platform telemetry statistics observed and shared with the AutoML framework via the database ID4_B50. AutoML framework extensions can help to find a beneficial search space for microcode patches. For example, the microcode search space can be customized per platform ID4_A20 rather than a one-size-fits-all approach.
The example microcode update manager ID4_C18 can also be used to validate a new microcode patch and/or download to the platform ID4_A20, as well as perform mile marker identification and capture. For example, a new microcode patch is evaluated using telemetry and performance hooks exposed for the mile marker analysis.
In certain examples, mile markers are extended across platform elements and protocols. Mile markers can be extended from processor microcode patch efficiency monitoring and tuning to evaluate other blocks in the platform ID4_A29 running their own firmware, for example. Information can be exposed to applications running locally, as well as the cloud ID4_C70, to affect change in (e.g., modify) platform ID4_A20 behavior and/or application ID4_C10 behavior via the local application(s) ID4_C10, for example. In certain examples, characterizing and tuning of individual components, as well as the full platform ID4_A20, can be crowd-sourced across a fleet or cluster of servers from one or more data centers in a hyper cloud deployment for at scale workload deployment and fleet management.
FIG. ID4_D illustrates an example implementation in which crowd-sourced analytics are tracked across crowd-sourced platforms. As discussed above, information can be exposed to applications running locally, as well as in “the cloud”, to affect a change in platform behavior via the local application(s). In certain examples, characterization and tuning of individual, as well as platform-level settings, behavior, etc., can be crowd-sourced across a fleet or cluster of servers from one or many data centers in a hyper-cloud deployment for workload deployment and fleet management at scale.
FIG. ID4_D illustrates an example crowd-sourced deployment ID4_D00 in which a plurality of hardware platforms ID4_A20, ID4_D10, ID4)D15 provide analytics (e.g., content, device data, connectivity information, processor information (e.g., CPIs, XPUs, etc.) via the network ID4_C60 to the cloud server ID4_C70 configured as a crowd-sourced cloud server ID4_C70. The crowd-sourced cloud server ID4_C70 provides analytics for a mile marker (e.g., TEE) or set of mile markers such as patch evaluation/characterization, mile marker telemetry analytics, content analytics, secure timer, device attributes, policy management, etc., in a secure storage on the crowd-sourced cloud server ID4_C70, for example. As such, data corresponding to a mile marker can be gathered and aggregated from multiple platforms ID4_A20, ID4_D10, ID4_D20, and analyzed for storage and further use via the cloud server ID4_C70.
While an example manner of implementing the portion of the example featurization search system ID4_A10 of FIG. ID4_A is illustrated in FIGS. ID4_B-ID4_D, one or more of the elements, processes and/or devices illustrated in FIGS. ID4_B-ID4_D may be combined, divided, re-arranged, omitted, eliminated and/or implemented in any other way. Further, the example block analyzer circuitry ID4_B10, the example architecture searcher circuitry ID4_B20, the example network trainer circuitry ID4_B30, the example marker capturer circuitry ID4_B40, the example database ID4_A50, the example software layer ID4_C20, the example hardware layer ID4_C40, the example processing hardware ID4_C50, the example ISA manager circuitry ID4_C40, the example network ID4_C60, the example server ID4_C70, and/or more generally, the example featurization search system ID4_A10 of FIGS. ID4_A-ID4_D may be implemented by hardware, software, firmware and/or any combination of hardware, software and/or firmware. Thus, for example, any of the example block analyzer circuitry ID4_B10, the example architecture searcher circuitry ID4_B20, the example network trainer circuitry ID4_B30, the example marker capturer circuitry ID4_B40, the example database ID4_B50, the example ISA manager circuitry ID4_C30, the example hardware layer ID4_C30, the example processing hardware ID4_C50, the example ISA manager circuitry ID4_C40, the example network ID4_C60, the example server ID4_C70, and/or more generally, the example featurization search system ID4_A10 of FIGS. ID4_A-ID4_D could be implemented by one or more analog or digital circuit(s), logic circuits, programmable processor(s), programmable controller(s), graphics processing unit(s) (GPU(s)), digital signal processor(s) (DSP(s)), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)) and/or field programmable logic device(s) (FPLD(s)). When reading any of the apparatus or system claims of this patent to cover a purely software and/or firmware implementation, at least one of the example block analyzer circuitry ID4_B10, the example architecture searcher circuitry ID4_B20, the example network trainer circuitry ID4_B30, the example marker capturer circuitry ID4_B40, the example database ID4_B50, the example software layer ID4_C20, the example hardware layer ID4_C30, the example processing hardware ID4_C50, the example ISA manager circuitry ID4_C40, the example network ID4_C60, and the example server ID4_C70 is/are hereby expressly defined to include a non-transitory computer readable storage device or storage disk such as a memory, a digital versatile disk (DVD), a compact disk (CD), a Blu-ray disk, etc. including the software and/or firmware. Further still, the example block analyzer ID4_B10, the example architecture searcher ID4_B20, the example network trainer ID4_B30, the example marker capturer ID4_B40, the example database ID4_A50, the example software layer ID4_C20, the example hardware layer ID4_C30, the example processing hardware ID4_C50, the example ISA manager circuitry ID4_C40, the example network ID4_C60, the example server ID4_C70, and/or more generally, the example featurization search system ID4_A10 of FIGS. ID4_A-ID4_C may include one or more elements, processes and/or devices in addition to, or instead of, those illustrated in FIGS. ID4_B-ID4_D, and/or may include more than one of any or all of the illustrated elements, processes and devices.
One or more of the elements, processes, and/or devices described above can be implemented using processor circuitry including at least one of a) a central processing unit, a graphic processing unit or a digital signal processor; b) a Field Programmable Gate Array (FPGA); or Application Specific Integrate Circuitry (ASIC). In such implementations, the at least one of the central processing unit, the graphic processing unit or the digital signal processor have control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus. In such implementations, the FPGA includes logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations. In such implementations, the ASIC includes logic gate circuitry to perform one or more third operations. The processor circuitry is to at least one of perform at least one of the first operations, the second operations or the third operations to execute processes, functions, and/or implement elements or devices described above.
In certain examples, one or more of the elements, processes, and/or devices described above can be implemented using an apparatus including at least one memory; instructions in the apparatus; and processor circuitry including control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more operations on the data, and one or more registers to store a result of one or more of the operations, the processor circuitry to execute the instructions to implement one or more of the elements, processes, and/or devices described above.
Flowcharts representative of example hardware logic, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the all or part of the example featurization search system ID4_A10 of FIG. ID4_A are shown in FIGS. ID4_E-ID4_G. The machine readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor and/or processor circuitry. The programs may be embodied in software stored on a non-transitory computer readable storage medium such as a CD-ROM, a floppy disk, a hard drive, a DVD, a Blu-ray disk, or a memory associated with the processor, but the entire program and/or parts thereof could alternatively be executed by a device other than the processor and/or embodied in firmware or dedicated hardware. Further, although the example programs are described with reference to the flowcharts illustrated in FIGS. ID4_E-ID4_G, many other methods of implementing the example portion of the example featurization search system ID4_A10 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., discrete and/or integrated analog and/or digital circuitry, an FPGA, an ASIC, a comparator, an operational-amplifier (op-amp), a logic circuit, etc.) structured to perform the corresponding operation without executing software or firmware. The processor circuitry may be distributed in different network locations and/or local to one or more devices (e.g., a multi-core processor in a single machine, multiple processors distributed across a server rack, etc.).
As mentioned above, the example processes of FIGS. ID4_E-ID4_G may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on a non-transitory computer and/or machine readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
In certain examples, a marker capturing means can be implemented by the example marker capturer circuitry ID4_B40 to extract information regarding the hardware platform ID4_C50 associated with a marker occurring during a first phase (e.g., the warm-up phase). The example marker capturing means is to store the information in a database, such as the example database ID4_A50. In certain examples, an architecture searching means can be implemented by the example architecture searcher circuitry ID4_B20 to, in the first phase, monitor a hardware platform to identify features associated with the marker to train an artificial intelligence model and, in a second phase (e.g., the exploration phase) to execute hardware-aware training of the artificial intelligence model using information from the database.
FIG. ID4_E is a flowchart representative of machine readable instructions which may be executed to implement all or part of the example AutoML apparatus ID4_A00 of FIG. ID4_A. In the example process ID4_E00 of FIG. ID4_E, the example featurization search system ID4_A10 analyzes the example hardware platform ID4_A20 to determine features associated with the example platform ID4_A20. (Block ID4_E10). For example, the example featurization search system ID4_A10 divides and/or otherwise organizes the hardware and software of the example platform ID4_A20 into blocks. Mile markers or event indicators (e.g., code start, code end, microcode benchmark, etc.) are captured by the example featurization search system ID4_A10 examining the configuration and activity of the example platform ID4_A20. The example featurization search system ID4_A10 saves the captured mile marker values and other associated data, analytics, etc., to form features associated with the example platform ID4_A20. Features can be organized into one or more data sets for training, testing, etc., of an ML network and/or other AI model, for example. Features can also be organized by architecture (e.g., CPU, GPU, FPGA, vision processing unit (VPU), XPU, etc.), and/or by platform capability (e.g., different generation processors have different capabilities, etc.), etc. The features are provided for use via the example database ID4_B50 (Block ID4_E20) and can serve as a basis for automated hardware-aware neural architecture search and network training using the identified features saved, aggregated, normalized, and/or otherwise processed as one or more data sets, for example.
The example model developer circuitry ID4_A30 uses features and other extracted and/or evaluated information to generate, train, test, and/or otherwise validate one or more AI models. (Block ID4_E30). For example, identified features and associated telemetry data can be used to form one or more data sets for training, testing, and/or other validation of an AI model construct. The AI model construct is customized for an underlying architecture and configuration based on the features and associated data, for example. In certain examples, the AI model construct is customized for data processing in addition to compute optimization and precision selection (e.g., FP32, INT8, INT4, other operations, etc.). The AI model construct can then be deployed on the example platform ID4_A20. (Block ID4_E40).
FIG. ID4_F is a flowchart representative of machine readable instructions which may be executed to implement all or part of the example featurization search system of FIGS. ID4_A-ID4_D to analyze the example platform ID4_A20 to determine features associated with the example platform ID4_A20. (Block ID4_E10 of the example of FIG. ID4_E). In the example process ID4_F00 of FIG. ID4_F, the example block analyzer ID4_B10 defines a featurization search space in terms of blocks or cells. (Block ID4_F10). Blocks/cells can be associated with markers or “hooks” (e.g., a custom code or function executed at a start of a code block, at an end of a code block, at a point of execution within the code block, etc., to provide a parameter value, status value, other data to the marker capturer ID4_B40) to enable telemetry information to be extracted with respect to the blocks or cells, for example. The example block analyzer ID4_B10 uses the blocks or cells as code blocks for monitoring. (Block ID4_F20). During a warm-up phase, the example architecture searcher circuitry ID4_B20 searches for features with respect to the code blocks. (Block ID4_F30). For example, the architecture searcher circuitry ID4_B20 monitors code to identify features that can be used to form data set(s) for training, testing, and/or validating an AI network model. Code blocks are examined while the example network trainer circuitry ID4_B30 is training the AI network model for accuracy in the warm-up phase. (Block ID4_F40). The example marker capturer circuitry ID4_B40 captures mile marker and/or other indicator values. (Block ID4_F50). For example, telemetry and/or other data associated with a start of a compute block, an end of a compute block, an execution point with a compute block, etc., can be collected by the example marker capturer circuitry ID4_B40 to determine a number of cells, code movement, parameter values, etc. In certain examples, the marker capturer circuitry ID4_B40 interacts with the microcode ID4_C52, ID4_C54 to capture data values, usage statistics, etc., for the example platform ID4_A20. After processing the values to format the values, analyze the values, normalize the values, etc., the values/statistics are saved in the example database ID4_B50. (Block ID4_F60). Statistics are collected and updated for code blocks and saved in the example database ID4_B50 for a number n epochs of the warm-up phase, for example. (Block ID4_F70). In certain examples, code blocks can be simulated, rather than actually executed, to gather marker data for processing, storage, and usage.
Operation then shifts to a neural architecture exploration phase, in which the example architecture searcher circuitry ID4_B20 conducts a hardware-aware featurization search based on the hardware information saved by the example marker capturer circuitry ID4_B40 in the example database ID4_B50. (Block ID4_F80). In certain examples, the database ID4_B50 includes marker information from a plurality of hardware captures, etc. The example network trainer circuitry ID4_B30 can then train a network model (e.g., an AutoML network model, another ML network model, a DL network model, etc.) based on the hardware-aware featurization search of the architecture searcher circuitry ID4_B20 (e.g., results stored in the database ID4_B50) to provide a trained network model for testing and deployment. (Block ID4_F90).
FIG. ID4_G is a flowchart representative of machine readable instructions which may be executed to implement all or part of the example featurization search system of FIGS. ID4_A-ID4_C. In particular, the example process ID4_G00 of FIG. ID4_G is an example implementation of capturing marker values (Block ID4_F50) using the example marker capturer circuitry ID4_B40 and its associated ISA manager circuitry ID4_C30. As shown in the example of FIG. ID4_G, the example host VMM ID4_C10 performs a write (ID4_G10) to a model-specific register (MSR) associated with the ISA of the example processing platform ID4_A20. The example microcode update manager ID4_C18 communicates with the example ISA decoder ID4_C38 to verify (ID4_G20) the authenticity of the writing of ISA information by the VMM ID4_C10 to the MSR. The example microcode update manager ID4_C18 also communicates with the example ISA decoder ID4_c38 to decode (ID4_G30) the ISA information associated with the write. The example ISA decoder ID4_C38 verifies (ID4_G40) an ISA configuration for a current session involving bits of a message passing interface (MPI) with the example ISA evaluator ID4_C36. The example ISA evaluator ID4_C36 also configures (ID4_G50) MPI bits of the ISA to allow execution of a mile marker, emulation, or to generate an exception based on provisioned policies for the session. As such, the mile marker can execute or be emulated/simulated to allow capture of data related to the mile marker, or an exception is generated if such execution/emulation is against provisioned policies for the platform ID4_A20 or its current session, for example. The ISA evaluator ID4_C36 applies (ID4_G60) the ISA configuration to the ISA decoder ID4_C38 for the current session, and the ISA decoder ID4_C38 triggers an exit (ID4_G70) from configuration by the example ISA manager ID4_C30. The example microcode update manager ID4_C18 then returns control (ID4_G80) to the host VMM ID4_C14.
Thus, certain examples enable a neural architecture search and training of a resulting neural network (and/or other AI network) to be hardware-aware and automatically capture and identify features of the hardware (and software) to optimize the resulting network model according to the underlying hardware (and/or software). The architecture search can examine a particular architecture configuration, taking into account latency, efficiency, and/or other configuration requirements/statistics in the measured code blocks to enable AutoML to train a network that satisfies these constraints.
In certain examples, blocks represent layers of a neural network (e.g., an ML network, etc.) and/or operations within a layer, depending on a desired level of granularity. The blocks and associated markers can be agnostic to a chosen architecture search strategy. Captured information is leveraged from the database to make an architecture search hardware-aware of the underlying platform(s) on which a resulting neural network can be run. Information can be stored in the example database for one or more hardware configurations at one or more levels of granularity to be applied, exported, shared, and/or otherwise used for network architecture identification and network training, for example. In certain examples, the resulting database can be packaged and deployed for use in a variety of AI network model training scenarios.
Mile markers and/or other indicators are tracers in underlying hardware, microcode, and/or pipeline execution to understand pipeline and associated hardware telemetry and store associated information in the database. In certain examples, operation codes (opcodes) and/or other machine instructions can be exposed to allow third parties to collect telemetry information and develop network models without input from the platform provider. For example, the example systems and methods described herein can enable an application programming interface (API) to allow third parties to define a search space, access values in the database, capture marker information to add to the database, etc.
From the foregoing, it will be appreciated that example systems, methods, apparatus, and articles of manufacture have been disclosed that provide automated featurization of a hardware platform and associated software to enable hardware-aware search and development of AI models. The disclosed systems, methods, apparatus, and articles of manufacture improve the efficiency and effectiveness of training, testing, and deploying AI models by enabling hardware-aware model development through identification and capture of features relevant to particular platform(s), configuration(s), etc., on which the AI model is to be deployed. The disclosed systems, methods, apparatus, and articles of manufacture are accordingly directed to one or more improvement(s) in the design and manufacture of processor circuitry to monitor, capture, process, and store such features and deploy them in a database for hardware-aware AI model development.
Using the disclosed systems, methods, apparatus, and articles of manufacture, mile markers can be configurable based on one or more policies and remotely managed by an entity such as a cloud server, edge server, administrator, etc. Based on telemetry and insights from mile markers, neural architecture search strategies can be improved based on past learning, incorrect predictions, etc., as evaluated using the telemetry data, mile marker insights, etc.
The following paragraphs provide various examples of the implementations disclosed herein.
Example 38 is an apparatus including: at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus; a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations; or Application Specific Integrate Circuitry including logic gate circuitry to perform one or more third operations; the processor circuitry to at least one of perform at least one of the first operations, the second operations or the third operations to: monitor, in a first phase, a hardware platform to identify features to train an artificial intelligence model; extract information regarding the hardware platform associated with a marker related to the features occurring during the first phase; and store the information in a database to, in a second phase, enable hardware-aware training of the artificial intelligence model.
Example 39 is the apparatus of example 38, wherein the processor circuitry is to at least one of perform at least one of the first operations, the second operations or the third operations to organize a feature search space into code blocks including a first code block associated with the marker.
Example 40 is the apparatus of example 38, wherein the processor circuitry is to at least one of perform at least one of the first operations, the second operations or the third operations to execute hardware-aware training of the artificial intelligence model with the information from the database.
Example 41 is the apparatus of example 38, further including a memory to store the database, the database accessible by one or more devices to drive development of the artificial intelligence model.
Example 42 is the apparatus of example 41, wherein the database is a cloud-accessible database.
Example 43 is the apparatus of example 41, wherein the information in the database is accessible to modify behavior of an application running on the hardware platform.
Example 44 is the apparatus of example 41, wherein the database is to include information gathered from at least one of a plurality of platforms or a plurality of configurations of the hardware platform.
Example 45 is the apparatus of example 41, wherein the marker is simulated rather than executed.
Example 46 includes at least one non-transitory computer readable storage medium including instructions that, when executed, cause at least one processor to at least: monitor, in a first phase, a hardware platform to identify features to train an artificial intelligence model; extract information regarding the hardware platform associated with a marker related to the features occurring during the first phase; and store the information in a database to, in a second phase, enable hardware-aware training of the artificial intelligence model.
Example 47 is the at least one computer readable storage medium of example 46, wherein the at least one processor is to organize a feature search space into code blocks including a first code block associated with the marker.
Example 48 is the at least one computer readable storage medium of example 46, wherein the at least one processor is to train the artificial intelligence model using the information in the database.
Example 49 is the at least one computer readable storage medium of example 46, wherein the at least one processor is to modify behavior of an application running on the hardware platform using the information in the database.
Example 50 is the at least one computer readable storage medium of example 46, wherein the at least one processor is to simulate execution of the marker to extract the information.
Example 51 includes a method including: monitoring, in a first phase, a hardware platform to identify features to train an artificial intelligence model; extracting information regarding the hardware platform associated with a marker related to the features occurring during the first phase; and storing the information in a database to enable hardware-aware training of the artificial intelligence model in a second phase.
Example 52 is the method of example 51, further including organizing a feature search space into code blocks including a first code block associated with the marker.
Example 53 is the method of example 51, further including training the artificial intelligence model using the information in the database.
Example 54 is the method of example 51, further including modifying behavior of an application running on the hardware platform using the information in the database.
Example 55 is the method of example 51, further including simulating execution of the marker to extract the information.
Example 56 includes a system including: marker capturing means to extract information regarding the hardware platform associated with a marker occurring during a first phase, the marker capturer to store the information in a database; and an architecture searching means to, in the first phase, monitor a hardware platform to identify features to train an artificial intelligence model and, in a second phase, execute hardware-aware training of the artificial intelligence model using the information from the database.
Example 57 is an apparatus including: at least one memory; instructions in the apparatus; and processor circuitry including control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more operations on the data, and one or more registers to store a result of one or more of the operations, the processor circuitry to execute the instructions to implement at least: a marker capturer to extract information regarding the hardware platform associated with a marker occurring during a first phase, the marker capturer to store the information in a database; and an architecture searcher to, in the first phase, monitor a hardware platform to identify features associated with the marker to train an artificial intelligence model and, in a second phase, execute hardware-aware training of the artificial intelligence model using the information from the database.
Example 58 is the apparatus of example 57, wherein the processor is to implement a block analyzer to organize a feature search space into code blocks including a first code block associated with the marker.
Example 59 is the apparatus of example 57, wherein the processor is to implement a network trainer to train the artificial intelligence model with the architecture searcher.
Example 60 is the apparatus of example 57, further including the database, the database accessible by one or more devices to drive development of the artificial intelligence model.
Example 61 is the apparatus of example 60, wherein the database is a cloud-accessible database.
Example 62 is the apparatus of example 57, wherein the information in the database is accessible to modify behavior of an application running on the hardware platform.
Example 63 is the apparatus of example 57, wherein the database is to include information gathered from at least one of a plurality of platforms or a plurality of configurations of the hardware platform.
Example 64 is the apparatus of example 57, wherein the marker is simulated rather than executed.
Example 65 is the apparatus of example 57, wherein the marker capturer includes: a telemetry manager to capture the information related to the marker; a microoperation surplus mapper to generate the marker; an instruction set architecture evaluator to dispatch the marker and evaluate at least one of a hardware state or a software state included in the information associated with the marker; and an instruction set architecture decoder to process the information for storage in the information in the database.
Example 66 is the apparatus of example 57, wherein the processor is in communication with a cloud server via a network.
Example 67 is the apparatus of any of examples 38-45, wherein extracting information regarding the hardware platform associated with a marker occurring during the first phase includes: capturing the information related to the marker; generating the marker; dispatching the marker and evaluating at least one of a hardware state or a software state included in the information associated with the marker; and processing the information for storage in the information in the database.
Example 68 is the computer-readable storage medium of any of examples 46-50, wherein the instructions, when executed, cause at least one processor to at least: extract information regarding the hardware platform associated with a marker occurring during the first phase includes: capturing the information related to the marker; generate the marker; dispatch the marker and evaluate at least one of a hardware state or a software state included in the information associated with the marker; and process the information for storage in the information in the database.
Example 69 is the method of any of examples 51-55, wherein extracting information regarding the hardware platform associated with a marker occurring during the first phase includes: capturing the information related to the marker; generating the marker; dispatching the marker and evaluating at least one of a hardware state or a software state included in the information associated with the marker; and processing the information for storage in the information in the database.
Although certain example systems, methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the claims of this patent. Although the examples disclosed herein have been shown in examples related to semiconductors and/or microprocessors, the examples disclosed herein may be applied to any other appropriate interconnect (e.g., a layered interconnect) application(s) or etching processes in general.
Efficient artificial intelligence (AI) and deep learning (DL) accelerators feature high-throughput mixed-precision and sparse compute capability. Model compression is a technique to adapt a neural network (NN) to cause the advantages of these features.
Two example classes/techniques of model compression, also called model optimization, include (1) Quantization and (2) Pruning. Both methods are predominant techniques of “compression” for the NN. Quantization is the process of converting a neural network execution graph that is originally operating in a first precision (e.g., floating point (FP32/16)) to operate in the target hardware (IHW) at a second precision (e.g., Int8/4/2/1). For example, the sum and product of each layer is operating at the prescribed units of the HW to provide layer-wise precision prescription solution (e.g., first precision, second precision). The layer-wise precision prescription solution provides deployment of a neural network in low precision and seizes lower power consumption and higher throughput of low-precision arithmetic units. Pruning is the process of sparsifying (introducing zeros) neural network parameters, resulting in computation and memory capacity/bandwidth savings. For example, connections (e.g., weights) and neurons (intermediate inputs/features) are pruned at every layer to provide a layer-wise pruning rate. These model compression techniques improve model efficiency by reducing memory requirements to store the model. For example, by pruning particular neurons the corresponding weights do not need to be stored, thus enabling storage savings for the model. Additionally, during runtime these particular weights and intermediate inputs don't need to be fetched from storage for execution. Furthermore, quantizing to a relatively lower precision also lowers the amount of space necessary for model storage. Model compression can be achieved using one and/or a combination of techniques. Other model compression techniques not mentioned above can also be utilized in examples disclosed herein.
However, in some examples, model compression causes accuracy degradation. To achieve improved performance and accuracy on target hardware, selection of fine-grained (layer-wise) precision and/or sparsity is required to occur in a manner consistent with specific parameters of that target hardware. Fine grained/layer-wise/per-layer compression precision includes layer-wise quantization and/or layer-wise pruning. Layer-wise quantization (e.g., (1) quantization) varies operating bit width (e.g., precision) at every layer. Layer-wise pruning (e.g., (2) pruning) varies sparsity levels (e.g., the number of zeros) at every layer. Appropriate selection of fine-grained (layer wise) precision and/or sparsity can vary based on network topology, available hardware, and/or performance targets/expectations. Examples disclosed herein improve model accuracy by, in part, automating the search of fine-grained design and optimizing performance and accuracy on target HW.
Existing techniques to compress models are unscalable and inefficient for large-scale model adaptation. The current techniques (e.g., an optimizer/learning agent) are repeated independently from scratch for every instance of a model (e.g., a neural network), its target platform and corresponding performance goals of the model. As a result, for every new workload having one or more models to be optimized (e.g., compressed), optimization resources (e.g., exploration agents) are spawned from a ground state absent of anything learned from previously spawned agents. Efficient example solutions to support scaling are disclosed herein. Examples disclosed herein create a scaling technique to support multiple XPU platforms (e.g., mix of architectures collectively described as XPU includes CPU, GPU, VPU, FPGA, etc.) where target deployment can be any combination of XPU platforms (heterogeneous inference). Examples disclosed herein include an adaptable system that supports different neural network topologies, datasets from different customers (e.g., tenants), and different target performance-accuracy trade-offs to support scaling and create a model (e.g., neural network) with improved efficiency.
Examples disclosed herein apply transferable and reusable agents without the need of optimizer/agent learning from scratch. The agents of examples disclosed herein are executable processes that learn internal representations of the relationship between policy/actions and the type of task, topology and/or target HW. The agents of examples disclosed herein prescribe the pruning rate and quantization rate for target hardware. Examples disclosed herein facilitate a manner of automation of a new set of HW targets, a new network architecture, and corresponding performance goals that converge comparatively faster than existing techniques.
FIG. ID6_1A is an example framework of structure and machine-readable instructions which may be executed to implement a search flow for model compression. FIG. ID6_1A shows the framework to apply transferable and re-usable agents without the need of optimizer/agent learning from scratch. In operation, an example input receiver circuitry ID6_102 is an example structure that receives, retrieves and/or otherwise obtains inputs (e.g., from a user, manual inputs, inputs from storage, etc.). The example input receiver circuitry ID6_102 includes example automation HW target receiver circuitry ID6_104, example training dataset receiver circuitry ID6_106, and example trained network definition receiver circuitry ID6_108.
Inputs include but are not limited to those from the example automation HW target receiver circuitry ID6_104. The example automation HW target receiver circuitry ID6_104 is an example structure that receives, retrieves and/or otherwise obtains the type of input target HW (e.g., VPU (int8/4/2/1), FPGA, CPU+GPU) and the required performance target(s). In some examples, target HW information is retrieved from user input, storage devices (e.g., databases) containing platform configuration data, and/or assets that seek information from hardware that is communicatively connected via one or more networks.
Inputs also include but are not limited to those from the example training dataset receiver circuitry ID6_106, which is an example structure that receives, retrieves and/or otherwise obtains training dataset(s) (e.g., from the user, from one or more training dataset storage devices, etc.). Training dataset(s) can be a full or subset of the original training dataset that is utilized to achieve the trained network of ID6_108. The training dataset(s) are used to recover the accuracy degradation caused by compression policies (e.g., quantization, pruning).
Inputs also include but are not limited to those from the example trained network definition receiver circuitry ID6_108. The trained network definition receiver circuitry ID6_108 is an example structure that receives, retrieves and/or otherwise obtains trained network(s) from the user. Trained networks are needed as they are the target of optimization, in which they are quantized or pruned. The trained network, when optimized (e.g., using quantization or pruning) can perform a new task (e.g., object detection, recommendation system) on a different data set (e.g., different images, different user inputs). Trained networks provide baseline of task performance (e.g., accuracy, latency), serving as the reference for observing any degradation due to compression. Instead of repeating the network training and starting with randomized initialized weights, the previous weights from the pretrained network can be saved and applied as initial weight values for the new task. Some advantages to using a pre-trained network include saving time, power, and resources because the pre-trained network gives the new task a quicker starting point.
An example agent ID6_110, also known as a learner or optimizer, is spawned or otherwise invoked to learn and optimize a scalable model compression for improved performance in view of particular platform characteristics. In the illustrated example of FIG. ID6_1A the example agent ID6_110 is an optimization algorithm and/or hardware that is implemented and/or otherwise represented in one or more neural networks. It can be, but is not limited to, deep reinforcement learning, Bayesian optimization, and/or evolutionary algorithms. In the examples disclosed herein, the example agent is directed to deep reinforcement learning algorithms. Examples disclosed herein can be located at and/or otherwise perform optimization tasks at any location of the example Edge cloud, such as the example Edge cloud A110 of
The example experience replay buffer circuitry ID6_112 is an example structure that contains data from previous iterations. The example experience replay buffer circuitry ID6_112 contains a historical policy, a reward, feedback from a compression environment, and/or evaluated hardware that is saved to substantiate the training of the example agent ID6_110 by speeding up and improving the training. Generally speaking, the experience replay buffer circuitry ID6_112 contains feedback and provides a dataset or experience for the agent ID6_110 to use to train (e.g., to learn the best compression policy of each neural network layer) any input neural network.
Example layer-wise mixed-precision sparsity policy predictor circuitry ID6_113 is an example structure that takes the prediction from the agent ID6_110 and checks for changes in delta (e.g., monitoring for diminishing or increasing returns) to decide when to stop iterations when performance is no longer improving. The example agent ID6_110 is responsible for, in part, predicting/inferencing a layer-wise mixed-precision configuration that is consumed by an example compression environment performer circuitry ID6_116. Layer-wise mixed-precision configuration is a technique used to find the optimal configuration for every layer of a trained neural network so inference is accelerated, and accuracy is maintained. The training of the neural network happens in full precision and hence the trained neural network is required to be input via trained network definition receiver circuitry ID6_108. Training involves training an agent. During each iteration, the layer-wise mixed-precision sparsity policy predictor circuitry ID6_113 explores a potential solution and by the end of the iteration, the layer-wise mixed-precision sparsity policy predictor circuitry ID6_113 is converged to an optimal solution. The agent is the predictor/inferencer that is reused during training. The output of the trained agent is a layer-wise or mixed-precision and/or sparsity policy.
The example compression environment performer circuitry ID6_116 is an example structure that performs compression and evaluates post compression metrics such as accuracy and distortion. The example compression environment performer circuitry ID6_116 receives a neural network fine-grained compression framework (e.g., realized as a policy and/or a particular configuration) from the layer-wise mixed-precision sparsity policy predictor circuitry ID6_113 and from the agent ID6_110. Additionally, the compression environment performer circuitry ID6_116 creates a hardware specific execution graph for hardware performance evaluation. The hardware specific execution graph provides feedback on latency, throughput, and/or power. Furthermore, the example compression environment performer circuitry ID6_116 provides feedback on post-compression accuracy and network dynamic observations to the agent ID6_110. The example compression environment performer circuitry ID6_116 is also communicatively connected to an example HW executable ID6_118 (e.g., target hardware or a software simulator) to perform compression and devaluate post compression metrics.
An example accuracy/network states checker circuitry ID6_114 is an example structure that checks for the accuracy and state of the neural network. The accuracy/network states checker circuitry ID6_114 compares a value (e.g., input by user, a threshold value and/or a predetermined value) to the accuracy and state of the neural network to determine if it has reached a predefined threshold (e.g., determined by user input and/or a predetermined value). The accuracy/network states checker circuitry ID6_114 serves as samples for the agent ID6_110 to learn from. The result of the comparison can determine if a change needs to occur (e.g., adjust weights) or to use/release the resulting value/model. The output of the accuracy/network states checker circuitry ID6_114 is stored in the experience replay buffer circuitry ID6_112.
The example HW executables ID6_118 is an example structure that takes the results from the compression environment performer circuitry ID6_116 and sends the results from the hardware specific execution graph for hardware performance evaluation and sends it to an example hardware performance evaluator circuitry ID6_124.
The example hardware performance evaluator circuitry ID6_124 is an example structure that evaluates the performance of a predicted policy of the example agent ID6_110. The example hardware performance evaluator circuitry ID6_124 can be a simulation model (e.g., statistical) or it can be a profiling application that deploys the hardware-mapped graph on the real target hardware. The example hardware performance evaluator circuitry ID6_124 also sends the performance feedback and hardware specific dynamic observations to the agent ID6_110.
An example hardware results sender circuitry ID6_120 is an example structure that receives evaluations from the example hardware performance evaluator circuitry ID6_124 and sends hardware metrics such as latency, throughput, and power to the agent ID6_110.
Example network outputer circuitry ID6_122 is an example structure that outputs the sparse and/or mixed-precision network deployable model on target hardware. The example network outputer circuitry ID6_122 outputs a compressed network with layer-wise configurations and a compressed network that is specific to target hardware. The resulting example optimal compression from the example network outputer circuitry ID6_122 can be employed in the application running on targeted hardware. The example network outputer circuitry ID6_122 sends an output that causes the compression environment performer circuitry ID6_116 to achieve compression goals with minimal accuracy impact.
FIG. ID6_1B illustrates example compression techniques. A fully trained NN ID6_126 is illustrated on the left side with a fully connected network of nodes. In the illustrated example of FIG. ID6_1B, nodes are represented as circles and weights are represented as lines between respective circles. The fully trained NN ID6_126 undergoes a fine-grained/layer-wise/per-layer compression policy, which can include layer-wise pruning ID6_128 and/or layer-wise quantization ID6_130. Example layer-wise pruning ID6_128 includes varying a sparsity level (e.g., number of zeros) at different layers. Example layer-wise quantization ID6_130 includes varying operating bit width (e.g., precision) at different layers.
Pruning ID6_128 is a compression technique that allows models to begin with a large NNs (e.g., NNs with more layers) and remove weights and/or nodes to reduce the size of the trained model, making it easier (e.g., relatively less bandwidth required) to distribute, and minimize loss in accuracy and performance. Pruning methods include layer-wise pruning, which allows the connections (e.g., weights) and neurons (e.g., intermediate input/features) to be pruned at every layer to increase accuracy and performance and decrease size and time to output results. An example of pruning weights can include setting individual parameters to zero and making the network sparse. This would lower the number of parameters in the model while keeping the architecture the same. An example of pruning nodes can include removing entire nodes from the network. This would make the NN architecture itself smaller, while aiming to keep the accuracy of the initial larger network.
Quantization ID6_130 is a compression technique that converts full precision to a lower precision. Quantization creates noise wherein noise refers to the distortion caused by rounding due to limited resolution of a given precision. For example, if 2 bits are used to represent values ranging between 0 and 1 then the simplest way to represent that is to have 00, 01, 10, 11 binaries to evenly represent the range (e.g., (0.25, 0.5, 0.75, 1)). Any given values between the intervals will be rounded to the closest value. For example, 0.89 will be rounded to 1 and be represented as 11, the value is distorted by 0.11. As described, quantization creates noise and distorts the original neural network and results in accuracy degradation. To circumvent the problem, neural network layers are quantized in a non-uniform way as some layers are more sensitive to distortion.
Both pruning and quantization are predominant “compression” techniques of NN. The compression techniques may result in 11 W-agnostic and 11 W-dependent model compression. HW-agnostic model compression results including model storage size and/or runtime memory footprint/bandwidth. HW-agnostic model compression is not dependent on custom HW, but some compression techniques provide HW-dependent results that require one or more particular architectures to execute pruned and/or quantized models having high throughput and/or low latency. HW-dependent model compression improves computation for specific HW that are dependent on the structure of the HW.
FIG. ID6_1C is another example of a compression framework. This is only one example of how one or more agents can make predictions, however, there are many other methods to accomplish this task and other tasks. Agent ID6_131 predicts pruning/quantization for one, multiple, or all layers at once. The agent ID6_131 may be implemented in a manner consistent with the example agent ID6_110 of FIG. ID6_1A. This example compression technique shows the example agent ID6_131 traversing each layer {t0, . . . , tn} in a target model ID6_138. At each layer quantizers ID6_140, ID6_142, ID6_144 quantize the target model ID6_138 and make a prediction given the state of the layer. An example agent controller ID6_132 (e.g., deep deterministic policy gradient (DDPG), a type of deep reinforcement learning algorithm) gathers and stores the samples from the target model ID6_138 in addition to samples from an example experience buffer ID6_134. The experience buffer ID6_134 stores data that is independently distributed to break any temporal correlation of behavior and distributes/averages data over many of its previous states to avoid oscillations or divergence in the model. The agent controller ID6_132 samples the experiences from the experience replay buffer ID6_134 and uses the samples to train the NN. The feedback from the quantization environment ID6_148 includes accuracy, model size, layer distortion and the feedback from the hardware evaluator ID6_150 includes the latency, throughput, and/or power all are assimilated to provide a metric of “goodness” (e.g., resulting in optimal solution) of a mixed precision ID6_147. These rewards, together with other attributes are saved to the experience buffer. The reward function ID6_136 gives rewards or more value to results (e.g., accuracy, model size, layer, latency, throughput, power) that output results with higher accuracy, accurate model size, lower power consumption, higher performance, etc. and/or reach specific targets set by the user, and/or pre-set in the storage. The agent ID6_131 retrieves and/or otherwise obtains samples of the quantized data ID6_147 from the target model ID6_138. The agent ID6_131 uses samples of the quantized data ID6_147 from the buffer and uses them to alter its network (e.g., alter, change, improve). In some examples, alterations are applied with different magnitudes depending on, for instance, a number of iterations or epochs a process or model has performed. Early instances of iterations, such as when there is no data available, will involve relatively greater magnitude changes to parameter adjustments. However, as iterations increase, the results of a modeling effort (e.g., a gain/loss function) may produce values that are relatively closer to ground truth values, or results during subsequent iterations may be relatively small. In such circumstances, the alterations may be relatively smaller and/or otherwise proportional to the amount of change in a calculated value from one iteration to the next. The agent ID6_131 then traverses through the quantizer ID8_138 to infer the precision by executing its network. The quantized data ID6_147 include mapping the input values from a large set (e.g., a continuous set of data) to output values in a smaller set often with finite numbers of elements. This can be implemented using rounding and/or truncation or any other method. The result of the quantization environment ID6_148 can be sent back to the agent ID6_131 where it is then evaluated for accuracy, model size, and layer distortion ID6_152. The result of the quantization environment ID6_148 can also be sent to a hardware evaluator ID6_150 if it is to be implemented on a specific hardware. The resulting data from the target model ID6_138 can be implemented for use in a hardware (e.g., VPU (int8/4/2/1), FPGA, CPU+GPU). The hardware is then evaluated for latency, throughput, and power ID6_154 and sent back to the agent ID6_131 where it can be applied to the reward function ID6_136. This example compression technique can be applied to different neural networks for different and tasks such as(e.g., object detection, text-to-speech synthesis, recommendation system) and different hardware (e.g., VPU (int8/4/2/1), FPGA, CPU+GPU).
FIG. ID6_2 is an example of a framework ID6_200 including structure and machine-readable instructions which may be executed to implement a search flow for a model compression method for three example customers with different requirements. The example structures of FIG. ID6_2 correspond to examples disclosed in FIG. ID6_1A. However, such examples are adapted for each customer A ID6_302, customer B ID6_304, customer C ID6_306 (see FIG. ID6_3, discussed in further detail below) which correspond to agent A ID6_202, agent B ID6_208, agent C ID6_214 respectively.
An example agent A ID6_202 is an example of the agent ID6_110 disclosed above that is associated with and/or otherwise services the requirements of the customer A ID6_302. The requirements of the agent A ID6_202 include an object detection workload, a VPU (Int8/4/2/1) target hardware, and a 2× latency improvement with ±1% accuracy goal. An example VPU (Int8/4/2/1, Sparce Compute) ID6_204 is an example structure for the example hardware performance evaluator circuitry ID6_124 for the Customer A ID6_302.
An example agent B ID6_206 is an example of the agent ID6_110 disclosed above that is associated with and/or otherwise services the requirements of the customer B ID6_304. The requirements of the example agent B ID6_206 include a text-to-speech synthesis workload, a FPGA target hardware, and a 50% of original model size, +3% accuracy impact. An example FPGA (Int X) ID6_208 is an example structure for the hardware performance evaluator circuitry ID6_124 for the customer B ID6_304.
An example agent C ID6_210 is an example of the agent ID6_110 disclosed above that is associated with an/or otherwise services the requirements of the customer C ID6_306. The requirements of the agent C ID6_210 include a recommendation system workload, a CPU+GPU target hardware, and a 30% sparce Int8 embedding, +1% accuracy impact. An example CPU (FP32, BF16, INT X) ID6_212 is an example structure for the hardware performance evaluator circuitry ID6_124 for the Customer C ID6_306.
An example GPU (FP16, sparce compute) ID6_214 is an example structure for the hardware performance evaluator circuitry ID6_124 for the customer C ID6_306. Generally speaking, the illustrated example of FIG. ID6_2 illustrates that traditional agent techniques that require initialization of an agent, the processing involving agent learning/training, and complete re-initialization and learning of an agent in the event a new customer requests evaluation tasks. Stated differently, examples disclosed herein retain the benefit of agent training when moving to one or more additional or alternative workload or target network training tasks, thereby saving computational resources and time savings to find the optimal configuration.
FIG. ID6_3 is an example of three customers with different requirements. An example customer A ID6_302 is an example of a customer with the requirements corresponding to the agent A ID6_202 that include an object detection workload, a VPU (Int8/4/2/1) target hardware, and a 2× latency improvement with ±1% accuracy goal. An example customer B ID6_304 is an example of a customer with the requirements corresponding to the agent B ID6_206 that include a text-to-speech synthesis workload, a FPGA target hardware, and a 50% of original model size, +3% accuracy impact. An example customer C ID6_306 is an example of a customer with the requirements corresponding to the agent C ID6_210 that include a recommendation system workload, a CPU+GPU target hardware, and a 30% sparce Int8 embedding, +1% accuracy impact.
FIG. ID6_4 is an example of the productivity improvement after using scalable model compression techniques disclosed herein. The illustrated example of FIG. ID6_4 includes a first example A ID6_402 to illustrate the total time it would take to train three agents using a regular neural network method that does not involve a scalable model compression method for optimal platform specialization. The three learning agents include a first learning agent ID6_404, a second learning agent ID6_406, and a third learning agent ID6_408, all of which represent three different example customers who require models to be trained. The example three learning agents (ID6_404, ID6_406 and ID6_408) are all individually spawned from scratch. Stated differently, the skilled agents of example A begin their execution tasks with default properties (e.g., settings, values, parameters, etc.) In this case a total search time ID6_410 is a function of a sum of the execution of each learning agent. In other words, it can take the same amount of time or more time to train one or more models due to the serial nature of the learning agent training process and the fact that traditional techniques do not realize the benefit of prior training efforts for subsequent agent configuration.
The illustrated example of FIG. ID6_4 includes a second example B ID6_412 to illustrate the total time it would take to train seven models (e.g., each model having at least one corresponding agent) using examples disclosed herein. That is example B ID6_412 illustrates, scalable model compression methods for platform specialization. The example seven reusable agents include a first reusable agent ID6_416, a reusable skilled agent ID6_418, a third reusable agent ID6_420, a fourth reusable agent ID6_422, a fifth reusable agent ID6_424, a sixth reusable agent ID6_426, and a seventh reusable agent ID6_428. These example seven reusable agents correspond to seven different customers that require models to be trained. In a first iteration for the example first reusable agent ID6_416 the total search time value ID6_430 takes the longest amount of relative time to train the model. In the second iteration for the example second reusable agent ID6_418 the total search time value ID6_430 takes less time to train the model than it took for the example first reusable agent ID6_416. For the following reusable agents: the example third reusable agent ID6_420, the example fourth reusable agent ID6_422, the example fifth reusable agent ID6_424, the example sixth reusable agent ID6_426, and the example seventh reusable agent ID6_428 it will take less time to train the models than it took for the example first learning agent ID6_416 and the example second learning agent ID6_418. In particular, the aforementioned improvement is enabled an/or otherwise caused by the fact that each subsequent agent that is invoked and/or otherwise assigned a task (e.g., a modeling task) does not start from a ground zero configuration state.
Comparing the first example A ID6_402 and the second example B ID6_412 there is a distinct advantage with Example B ID6_412 where the scale of delivery of compressed models for a given turn around time/lead shows a temporal improvement for finding optimal compression policies for large-scale customers (or custom models) and hardware platforms. As described above, instead of learning from scratch at every instance, examples disclosed herein enable the learning agents to become more efficient over time and take less time to train the model. The time to find an optimal policy for new tasks is faster and more scalable for various customers and their diverse requirements. Examples disclosed herein are implemented through a generalized learning architecture that supports all types of neural network operators and platform (Edge) architectures. Examples disclosed herein are implemented through transfer learning of one or more learning agents. In some examples, this is implemented through knowledge accumulation via a central experience database.
There is a distinct advantage in using a scalable model compression method for optimal platform specialization if there is a constraint where only one project can be run at any given time. Because projects are executed sequentially, efficiency and speed is improved for each project to have a shorter run-time. Stated differently, each project would require fewer resources to complete. For the same fixed amount of time and computing resources, a relatively greater number of projects can be completed as compared to conventional techniques (regardless of serial or parallel implementation).
FIG. ID6_5 is an example of a generalized agent architecture ID6_500 for example scalable model compression techniques for improved platform specialization. Examples disclosed herein enable transferability of skilled agents for tasks (e.g., compression tasks (layer-wise pruning and/or quantization) that satisfy input objectives and constraints). A learned agent has the internal representation (e.g., mathematical function) of how compression decisions on neural network operators affect accuracy and performance. The internal representation is a mathematical function (e.g., formula) that maps the input variable to a common contextual measure (e.g., set of values). As such, examples disclosed herein reuse the agent to save computational exploration efforts on improbable action spaces (configuration). Examples disclosed herein optionally expose one or more knobs to facilitate selection and/or adjustment of selectable options. Knobs can include decisions for actionable compression, or the output of agent prediction. Regarding quantization, knobs can include changes in the precision target (e.g., int8, int4, int1, etc.), changes in the type of quantization (e.g., per-channel, per tensor), changes in affine or non-uniform quantization. Regarding precision, knobs can include type of importance method and/or type of granularity (e.g., fine-grained, vector-level, kernel level). Knobs may be selected by, for example, a user and/or the agent(s). Agent knob adjustment may occur in an automatic manner independent of the user in an effort to identify one or more particular optimized knob settings.
For example, an agent could learn that a fully-connected (FC) layer poses more redundancy than convolutional layer (e.g., more FC layers). However, the agent would also learn that the compressibility is also determined by the particular location of FC layers as the distortion at different parts of the network could degrade accuracy differently. When mixed-precision compute is unavailable, the agent learns to be more aggressive in pruning as it is the only way to improve performance via memory savings. As the example agent learns the complex interplay of the variables. In some examples herein, variables can include factors that affect compressibility of the network including the type of operator (e.g., FC, convolutional layer, embedding, etc.). In some examples herein, variables can include next-level attributes of the operator including the location of the operator, the connectivity between operators, the dataset size used by the target network, and/or hardware attributes. Examples disclosed herein retain the prior knowledge to help the agent to decide the probable solutions when it is reused in another (e.g., similar) network architecture and platform target.
No certain or predetermined architecture on interaction, policy, or value network is enforced because examples disclosed herein enable the realization that there are many different topologies that can efficiently complete the task. Rather, examples disclosed herein facilitate selection of the best topology based on observed iterations while the network converges on an optimal solution for the task.
In the illustrated example of FIG. ID6_5, an example model inputter ID6_502 is an example structure that receives an input model (e.g., from the user, from storage). An example target HW inputter ID6_504 is an example structure that receives target HW information (e.g., from the user). An example compressible operation embedder ID6_506 is example structure that employs embedding layers in agent architecture to map models under compression to latent space representations (vectors). For example, the compressible operation embedder ID6_506 maps arbitrary input network operator(s) to a common dimension (fixed-length) of vector (a set of values) which contextually draws the relativity between different inputs with respect to the output space (e.g., abstract space, latent space). This function is not known or infeasible to derive analytically and the function is trained using machine learning procedures (e.g., deep learning).
In some examples, each compressible operation is translated to one/multi-hot encoding. These embeddings will be learned during reinforcement learning operations performed by the agent. Because the learned embeddings are one/multi-hot encoded, they can be reused (e.g., reused on new target networks) and expanded(e.g., new addition of operator internal representation). An example one-hot encoding includes representing the presence of an object (e.g., “car”) in an example five-dimensional vector like [0,1,0,0,0]. An example multi-hot encoding also includes first label-encoding classes, thus having only a single number which represents the presence of a class (e.g., 1 for “car”) and then convert the numerical labels to binary vectors of size [log2 5]=3 (e.g., “computer”=[0,0,0], “car”=[0,0,1], “phone”=[0,1,0]). One/multi-hot encoding can be reused and expanded on any set of optimization targets.
An example operation latent representor ID6_516 is an example structure that normalizes the compressible operation embedder ID6_506. At least one advantage with normalizing the compressible operation embedder ID6_506 is it creates a normalized vector that is the same size regardless the size of the input so that it can be re-used for further iterations and other customers.
An example platform embedder ID6_508 is example structure that embeds layers in agent architecture to map platform attributes to latent space representation (vector). Each compressible operation is a layer that can be mapped on mixed-precision/sparse HW and is translated to one/multi-hot encoding. Categorical HW attributes such as CPU, VPU, GPU, FPGA, SKU, etc. capability types are also encoded in a similar fashion. These embeddings will be learned during reinforcement learning operations by the agent. As they are one/multi-hot encoded, the learned embeddings can be reused and expanded on any target hardware.
An example HW latent representator ID6_518 is example structure that normalizes the platform embedder ID6_508. The advantage with normalizing the platform embedder ID6_508 is it creates a normalized vector that is the same size regardless the size of the input so that it can be re-used for further iterations and other customers.
An example static attributer ID6_510 is an example structure with real value quantities that can include (e.g., store) operator hyperparameters (e.g., convolution kernel size, stride, input feature map size) and hardware attributes (e.g., number of processing elements, cache size). The static attributer ID6_510 is a direct representation of attributes of model(s) under compression and the properties of the input target hardware, which are static during the lifetime of the search/reinforcement learning flow. The static attributor ID6_510 is a memory (e.g., storage) that includes these hyperparameters (e.g., attributes, properties).
An example dynamic observer ID6_512 is example structure with real value quantities which can be quantities that indicate the states of an explored policy (e.g., compression distortion per operator, compression budget, relative accuracy). Dynamic features are usually the feedback from the compression environment and hardware evaluator.
An example normalized dense feature representor ID6_520 is an example structure that normalizes the static attributer ID6_510 and dynamic observer ID6_512. The advantage with normalizing the static attributer ID6_510 and dynamic observer ID6_512 is it creates a normalized vectors that's the same size regardless of the size of data so that it can be re-used for further iterations and other customers.
An example compression environment/hardware evaluator ID6_514 is an example structure that evaluates information sent from the compression environment performer circuitry ID6_116.
An example interaction network combiner ID6_522 is an example structure that trains a neural network to capture the non-linear relationship between (a) the example operation latent representor ID6_516, (b) the example HW latent representator ID6_518, and (c) the example normalized dense feature representor ID6_520.
An example policy network outputer ID6_524 is an example structure that outputs the interaction network for policy networks and invokes the actor-critic reinforcement learning network. The policy network outputer ID6_524 learns probability distribution corresponding to the state of which it has to predict. In some examples, the policy network outputer ID6_524 can be referenced as forward pass or as an inference. In some examples, the policy network outputer ID6_524 will use the skilled agent to look at a certain piece of data and as a result of learning the probability distribution the policy network decides what kind of compression or pruning needs to take place. In some examples, the output corresponds to a compression decision. As shown in ID6_1C, at each t=0 . . . t=n in the target network ID6_138, the policy network outputer ID6_524 in combination with the structures in FIG. ID6_5 will be executed, and a decision will be obtained. Completion at t=n, results in a complete compression policy ID6_147. To implement this policy the quantization performer ID6_148 is used. The quantization performer ID6_148 is an example of the compression environment performer circuitry ID6_116 in FIG. ID6_1A. The example quantization performer ID6_148 takes the compression policy and performs the compression (e.g., reduce the weights) on the target network and thus outputting the policy network ID6_524 using the policy network outputer ID6_524.
An example value network outputer ID6_526 is an example structure that outputs the interaction network for value networks and exhibits the actor-critic reinforcement learning network. In some examples, the value network is a neural network that maps a current state (e.g., output of the interaction network combiner ID6_522) and/or next action (output of the policy network outputer ID6_524) to a goodness metric.
The value network outputer ID6_526 predicts the value of the skilled agent when it is in a given state. In some examples the value network outputer ID6_526 predicts the value of the action the skilled agent can take. Predictions can be done to decide the state of the current prediction of the policy is. Predictions can be done to decide if it is advantageous to be in a certain state. The value network ID6_526 can be reused due to no dimensionality changes in input and output of this network and thus the correlation of value of similar target networks under compression is carried over and retained to new tasks and/or projects.
FIG. ID6_6 is an example of transfer learning techniques disclosed herein that reuse data corresponding to customers with different requirements to create a scalable model compression method for optimal platform specialization. As used herein, transfer learning of machine learning means a model developed for a task is reused as the starting point for a model on a second or subsequent task. Advantages to reusing the data and/or networks include allowing reuse with no dimensionality changes in input and output of this network. Hence, it does not have to be learned again for subsequent tasks and/or clients. Furthermore the correlation of value of similar target networks under compression is carried over and/or retained to new tasks and/or new clients.
FIG. ID6_6 is the same diagram as FIG. ID6_2, however, it demonstrates through the use of reuse ID6_610 and reuse ID6_612 that the skilled agent can be reused for future customers with different workloads, different hardware targets, and/or different goals. Generally, even with customers with widely different tasks there is still some value in each model and skilled agents that can be re-used. Accordingly, examples disclosed herein improve computational efficiency (e.g., time saving) in agent exploration activities.
An example reuse ID6_610 is an example illustration that shows how Agent B ID6_604 is able to reuse at least some data from agent A ID6_602. The meaning of reuse is (1) the embedding and networks (interaction, policy, value) are loaded with the pretrained version in previous tasks (2) and their respective parameters will be adjusted during the iterative cycles for the specific task at hand. This “adjustment” process can also be known as fine-tuning.
An example reuse ID6_612 is an example illustration that shows how where agent C ID6_606 is able to reuse at least some data from agent B ID6_604, which already has data from agent A ID6_602. The meaning of reuse is (1) the embedding and networks (interaction, policy, value) are loaded with the pretrained version in previous tasks (2) and their respective parameters during the iterative cycles for the specific task at hand. This adjustment process can also be known as fine-tuning.
Agent A ID6_602 is an example structure that is taken through a first reinforcement learning iterative cycles and converges (the model corresponding to VPU converges) on customer A ID6_302 objective(s). During subsequent iterations (e.g., 2nd, 3rd, 4th, etc.), the learned agent (e.g., agent A ID6_602 after the first iteration) is re-used as a starting point for any following tasks such as customer B ID6_304 and/or customer C ID6_306 during subsequent iterations.
FIG. ID6_7 is an example of a framework to accumulate knowledge through a central database. Transfer learning in deep reinforcement learning can be limited by an overfit agent. During each instance of policy search, the model under compression is limited to a set of network operators. For example, recommendation models are likely to use fully connected layers and embeddings, but rarely with convolution. Hardware attributes and dense features are also localized to the intended platform. The lack of variety in experiences (per network operator, per hardware, per objectives) in the local experience replay buffer circuitry ID6_112 leads to an agent memorizing recent rewarding actions and forgetting prior learning. This happens because the agent update process only samples from the local replay buffer that does not contain memories of previous policy search. This is analogous to the prevalent issue of imbalance categories in classification problems where the model biases to a category with more samples. This issue precludes transferability, where a pretrained agent is likely to function on similar target models and similar hardware objectives.
New operators in neural network architectures and new platforms are inevitable. The one/multi-hot encoded (as described in FIG. ID6_5) embedding design allows extension to the new operators and new platforms. In essence, a new operator or a new platform is an addition of a dimension and an entry (e.g., new/alternate vectors) in the embedding. Dimensions for existing operators can be annotated accordingly. Overall potential issues include (1) overfitting to recent tasks and goals and memorization of compression action and (2) generalization to new network operator and hardware platforms. These potential issues are solved through knowledge accumulation via a central experience database ID6_712 as shown of FIG. ID6_7.
In the illustrated example of FIG. ID6_7, an example agent ID6_704 is an example structure that has the same function as the agent ID6_119 which takes actions to learn and optimize a model compression method for optimal platform specialization. In some examples, the example agent ID6_704 is scalable when implemented in a generalized architecture FIG. ID6_5. An example layer-wise mixed precision/sparsity policy predictor ID6_702 is an example structure that has the same function as the layer-wise mixed precision/sparsity policy predictor circuitry ID6_113 of FIG. ID6_1A, which takes the prediction from the agent ID6_110 and checks for changes in delta (e.g., is the change increasing or decreasing) to decide (e.g., when to stop iterations when performance is no longer improving).
An example compression environment performer ID6_706 is an example structure that has the same function as the compression environment performer circuitry ID6_116 of FIG. ID6_1A, which performs compression and evaluates post compression metrics such as accuracy and distortion to determine if further adjustments should be made. The output is fed back to the agent ID6_704 via experience replay buffer ID6_710 or central experience database ID{circumflex over ( )}_712. An example accuracy/network states checker ID6_708 of ID6_1A is an example structure that has the same function as the accuracy/network states checker circuitry ID6_114 of ID6_lA, which checks for the accuracy and state of the neural network to determine if further adjustments should be made. The example accuracy/network states checker ID6_708 of ID6_1A is an input to the dynamic observer FIG. ID6_512. An example experience replay buffer ID6_710 is an example structure that has the same function as the experience replay buffer circuitry ID6_112, which that contains data from previous iterations. The experience replay buffer ID6_710 contains a historical policy, reward, feedback from compression environment and hardware evaluator are saved to substantiate the training of the Agent ID6_704.
An example central experience database ID6_712 is an example structure that is a repository of historical experiences across most if not all prior learning. The notion of knowledge accumulation is that an agent trained with either online or offline methods have practiced, over time, incrementally, in the abundance of multi-objective, multi-target experiences. The example central experience database ID6_712 captures the explored compression policy, its corresponding embedding, dense features, observations, reward scores, hardware performance, etc. The key purpose of this database is to provide diversity of experiences during the agent ID6_704 training. Applying the central experience database ID6_712 to compression policies helps prevent overfitting of agent ID6_704 via methods discussed previously. This can be completed through the following example mechanisms/techniques.
Online experience augmentation is an example mechanism/technique that provides diversity of experience during the agent ID6_704 training. During each agent ID6_704 update (forward and backward pass), it is provided a mixture of experiences from local replay buffer and the central experience database by sampling proportionally to the coverage of network operator, hardware features, rewards. The increase in diversity and data size in general results in a robust network.
Offline agent fine-tuning is an example mechanism/technique that provides diversity of experience during the agent ID6_704 training. If the pretrained agent performs subpar, it can optionally fine tune the agent as if a supervised machine learning is using the central experience database ID6_712. An example hardware performance evaluator ID6_714 is an example structure that has the same function as the hardware performance evaluator circuitry ID6_124, which evaluates the performance of the agent ID6_110 predicted policy. An example hardware results sender ID6_716 is an example structure that has the same function as the hardware results sender circuitry ID6_120, which receives evaluations from the hardware performance evaluator circuitry ID6_124 and sends hardware metrics such as latency, throughput, and power to the agent ID6_110.
From the foregoing, it will be appreciated that example methods, systems, apparatus and articles of manufacture have been disclosed that improve the efficiency of using a computing device by automating convergence and with a shorter turnaround time to search the compression configuration optimized for a given neural network, target accuracy and performance. Additionally, examples disclosed herein allow users to scale faster and to dynamically convert their custom network topologies or variants to specialize across hardware platforms. Furthermore, there is a distinct advantage for projects that are constrained to running one job at any given time, as a shorter run-time would result in a more efficient and faster result. The disclosed methods, apparatus and articles of manufacture are accordingly directed to one or more improvement(s) in the functioning of a computer.
In some examples, the input receiver circuitry ID6_102 includes means for hardware target receiving, means for training dataset receiving and means for network definition receiving. Examples also include means for policy predicting, means for experience replay buffering, means for compression performing, means for state checking, means for network outputting, means for hardware performance evaluating and means for results sending. For example, the means for hardware target receiving may be implemented by the automation HW target receiver circuitry ID6_104, the means for training dataset receiving may be implemented by the training dataset receiver circuitry ID6_106, the means for network definition receiving may be implemented by trained network definition receiver circuitry ID6_108, the means for policy predicting may be implemented by the layer-wise mixed-precision/sparsity policy predictor circuitry ID6_113, the means for experience replay buffering may be implemented by the experience replay buffer circuitry ID6_112, the means for compression performing may be implemented by the compression environment performer circuitry ID6_116, the means for state checking may be implemented by the accuracy/network states checker circuitry ID6_114, the means for network outputting may be implemented by the network outputter circuitry ID6_122, the means for hardware performance evaluating may be implemented by the hardware performance evaluator circuitry ID6_124, and the means for results sending may be implemented by the hardware results sender circuitry ID6_120. In some examples, the aforementioned structure may be implemented by machine executable instructions disclosed herein and executed by processor circuitry, which may be implemented by the example processor circuitry D212 of
Further variations of the above-identified disclosed examples are provided by the following examples.
Example methods, apparatus, systems, and articles of manufacture to optimize resources in Edge networks are disclosed herein. Further examples and combinations thereof include the following: Example 95 includes an apparatus comprising at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus, a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations, or Application Specific Integrate Circuitry including logic gate circuitry to perform one or more third operations, the processor circuitry to at least one of perform at least one of the first operations, the second operations or the third operations to calculate a first compression policy for a first model to execute on first hardware, the first compression policy including first compression predictions corresponding to respective layers of the first model, compare performance metrics of the first compression policy with first model parameters associated with the first model, release a first compressed model corresponding to the first compression policy when the performance metrics satisfy a performance threshold, and in response to retrieving a second model to execute on second hardware, calculate a second compression policy based on the first compression predictions corresponding to the respective layers of the first model.
Example 96 includes the apparatus as defined in example 95, wherein the processor circuitry is to apply at least one of a deep reinforcement learning algorithm, a Bayesian optimization algorithm, or an evolutionary algorithm.
Example 97 includes the apparatus as defined in example 95, wherein the processor circuitry is to iterate outputs of the first compression predictions with a reward function, the reward function based on the first model performance parameters.
Example 98 includes the apparatus as defined in example 95, wherein the first compression policy is based on at least one of pruning layers of the first model or quantizing the first model.
Example 99 includes the apparatus as defined in example 95, wherein the processor circuitry is to reduce a model compression optimization duration by executing an agent having model parameters corresponding to the first compressed model during an initial modeling iteration.
Example 100 includes the apparatus as defined in example 99, wherein the processor circuitry is to optimize based on at least one of a graphics processing unit target, a central processing unit target or a field programmable gate array target.
Example 101 includes At least one non-transitory computer readable storage medium comprising instructions that, when executed, cause at least one processor to at least calculate a first compression policy for a first model to execute on first hardware, the first compression policy including first compression predictions corresponding to respective layers of the first model, compare performance metrics of the first compression policy with first model parameters associated with the first model, release a first compressed model corresponding to the first compression policy when the performance metrics satisfy a performance threshold, and in response to retrieving a second model to execute on second hardware, calculate a second compression policy based on the first compression predictions corresponding to the respective layers of the first model.
Example 102 includes the at least one computer readable storage medium as defined in example 101, wherein the instructions, when executed, cause the at least one processor to apply at least one of a deep reinforcement learning algorithm, a Bayesian optimization algorithm, or an evolutionary algorithm.
Example 103 includes the at least one computer readable storage medium as defined in example 101, wherein the instructions, when executed, cause the at least one processor to iterate outputs of the first compression predictions with a reward function, the reward function based on the first model performance parameters.
Example 104 includes the at least one computer readable storage medium as defined in example 101, wherein the first compression policy is based on at least one of pruning layers of the first model or quantizing the first model.
Example 105 includes the at least one computer readable storage medium as defined in example 101, wherein the instructions, when executed, cause the at least one processor to reduce a model compression optimization duration by executing an agent having model parameters corresponding to the first compressed model during an initial modeling iteration.
Example 106 includes the at least one computer readable storage medium as defined in example 105, wherein the instructions, when executed, cause the at least one processor to optimize based on at least one of a graphics processing unit target, a central processing unit target or a field programmable gate array target.
Example 107 includes a method comprising calculating a first compression policy for a first model to execute on first hardware, the first compression policy including first compression predictions corresponding to respective layers of the first model, comparing performance metrics of the first compression policy with first model parameters associated with the first model, releasing a first compressed model corresponding to the first compression policy when the performance metrics satisfy a performance threshold, and in response to retrieving a second model to execute on second hardware, calculating a second compression policy based on the first compression predictions corresponding to the respective layers of the first model.
Example 108 includes the method as defined in example 107, wherein calculating the first compression policy includes applying at least one of a deep reinforcement learning algorithm, a Bayesian optimization algorithm, or an evolutionary algorithm.
Example 109 includes the method as defined in example 107, wherein calculating the first compression policy includes iterating outputs of the first compression predictions with a reward function, the reward function based on the first model performance parameters.
Example 110 includes the method as defined in example 107, wherein the first compression policy is based on at least one of pruning layers of the first model or quantizing the first model.
Example 111 includes the method as defined in example 107, wherein calculating the second compression policy includes reducing a model compression optimization duration by executing an agent having model parameters corresponding to the first compressed model during an initial modeling iteration.
Example 112 includes the method as defined in example 111, wherein the agent optimizes based on at least one of a graphics processing unit target, a central processing unit target or a field programmable gate array target.
Example 113 includes an apparatus comprising an agent to calculate a first compression policy for a first model to execute on first hardware, the first compression policy including first compression predictions corresponding to respective layers of the first model, compression environment performer circuitry to compare performance metrics of the first compression policy with first model parameters associated with the first model, and accuracy checker circuitry to release a first compressed model corresponding to the first compression policy when the performance metrics satisfy a performance threshold, the agent to in response to retrieving a second model to execute on second hardware, calculate a second compression policy based on the first compression predictions corresponding to the respective layers of the first model.
Example 114 includes the apparatus as defined in example 113, wherein the agent is to apply at least one of a deep reinforcement learning algorithm, a Bayesian optimization algorithm, or an evolutionary algorithm.
Example 115 includes the apparatus as defined in example 113, further including experience replay buffer circuitry to iterate outputs of the first compression predictions with a reward function, the reward function based on the first model performance parameters.
Example 116 includes the apparatus as defined in example 113, wherein the first compression policy is based on at least one of pruning layers of the first model or quantizing the first model.
Example 117 includes the apparatus as defined in example 113, wherein the agent is to reduce a model compression optimization duration by executing a previous version of the agent having model parameters corresponding to the first compressed model during an initial modeling iteration.
Example 118 includes the apparatus as defined in example 117, wherein the agent is to optimize based on at least one of a graphics processing unit target, a central processing unit target or a field programmable gate array target.
Quantization of deep learning models require deciding what operations to quantize and how to quantize them. Quantization for deep learning is the process of approximating a neural network that is initially structured to use a first bit width (e.g. floating-point numbers) with an alternate bit width representation that consumes a relatively lower bit width. This reduces both the memory requirement and computational cost of using neural networks. Additionally, this also causes improvements to power requirements, particularly in view of Edge devices and their various limitations.
Generally, particular operations to be quantized are decided and input by a user. However, this is time intensive for the user, and because user selection is driven by discretionary behaviors (e.g., “gut feel”), such selections lack an optimum efficiency. Some operations within a network are particularly suited for quantization efforts to yield varying degrees of success. Factors that affect a decision to quantize the particular operations include, but are not limited to an initial bit width, a type of operation, a type of instruction associated with the operation (e.g., a MatMul operation, a GatherNd operation), and/or an adjacency of the instructions proximate to other instructions. Examples disclosed herein apply reinforcement learning to decide whether to quantize operations in a neural network model, thereby eliminating erroneous user discretion and reducing model developing time.
Modern deep learning neural network models have many quantizable operations, which makes manual decision making of whether to quantize or not for each operation inefficient due to a large problem space. To identify particular models to quantize, to identify which operations to quantize, and/or to select alternate bit widths, human efforts can take, for example, approximately 4 weeks whereas examples disclosed herein take, for example, approximately 10 hours (or less). Training a neural network through human efforts takes more time because they need to evaluate operations for parameters and make decisions on whether the operation is quantizable or not. However, creating a framework to automate this process makes quantization of deep learning neural network models more efficient. Additionally, examples disclosed herein reduce and/or otherwise eliminate errors due to human discretionary choices. Furthermore, modern deep learning neural network models can be more difficult to solve with varying hardware resources (CPUs, GPUs, accelerators, etc.) where performance of quantizable operations differ.
In example methods disclosed herein, quantization is achieved using grouping where adjacent or similar operations with similar quantization operations can be grouped into a large block and can be quantized collectively.
Quantization is also used as neural networks move from servers (e.g., having relatively capable processing and/or power resources) to the Edge environment (e.g., having relatively less capable processing and/or power resources) because it is necessary to optimize speed and size due to hardware limitations (e.g., CPU vs. GPU). Quantization replaces floating points with integers inside the neural network. Replacing floating points (e.g., weights, optimization weights) with integers results in less memory consumption and faster calculations/operations.
FIG. ID7_1 is an example schematic illustration of a framework ID7_100 for generating and providing optimal quantization weights to deep learning models. Examples disclosed herein can be located at and/or otherwise perform optimization tasks at any location of the example Edge cloud, such as the example Edge cloud A110 of
In the illustrated example of FIG. ID7_1, the framework ID7_100 includes quantization controller circuitry ID7_102, quantized topology generator circuitry ID7_104, environment quantizer circuitry ID7_106, reward assigner circuitry ID7_108, and search space solver circuitry ID7_110. In operation and as discussed below, the example framework ID7_100 generates and/or otherwise provides improved quantization weights for deep learning models.
Some examples include means for quantization controlling, means for quantized topology generating, means for environment quantizing, means for reward assigning, and means for search space solving. For example, the means for means for quantization controlling may be implemented by the quantization controller circuitry ID7_102, the means for quantized topology generating may be implemented by the quantized topology generator circuitry ID7_104, the means for environment quantizing may be implemented by the environment quantizer circuitry ID7_106, the means for reward assigning may be implemented by the reward assigner circuitry ID7_108, the means for search space solving may be implemented by the search space solver circuitry ID7_110. In some examples, the aforementioned structure may be implemented by machine executable instructions disclosed herein and executed by processor circuitry, which may be implemented by the example processor circuitry D212 of
The example quantization controller circuitry ID7_102 is an example of a structure to decide what types of operations should be quantized and how to approach quantization decisions. Deep learning models (e.g., neural networks) are resource intensive algorithms that cause processing resources to incur significant computational costs and memory, which is why quantization is necessary. The quantization controller circuitry ID7_102 optimizes the training and inference of deep learning models to reduce (e.g., minimize) costs (e.g., memory costs, CPU costs, bandwidth consumption costs, accuracy tradeoff costs, storage space costs, etc.). When running on the example Edge cloud A110 of
The example quantization controller circuitry ID7_102 generates potential MatMul operation candidates. While examples disclosed herein refer to MatMul operations, such examples are discussed for convenience and not limitation. Example operations disclosed herein are not limited to MatMul operations, and may include convolutions, Relu, concat, Conv2D, Conv3D, transpose, GatherNd, etc. The MatMul operation is a common implementation of matrix operations in deep learning. The MatMul operation returns the matrix product of two arrays. The MatMul operation returns a normal product for 2-D arrays; however, if dimensions of either argument are greater than two then the MatMul operation is treated as a stack of matrices residing in the last two indexes. Furthermore, arrays with different shapes (e.g., two or more arrays of different sizes) can use broadcasting. Broadcasting provides a way of vectorizing array operations for looping. Once the example quantization controller circuitry ID7_102 decides what type of operation should be quantized, the quantization controller circuitry ID7_102 can use solutions from MatMul. The quantization controller circuitry ID7_102 initiates training to converge to an optimal solution. The optimal solution can be a value from a user, a user input, a particular convergence threshold value, and/or predetermined value in storage. Using solutions from MatMul operations, the quantization controller circuitry ID7_102 also determines which hardware would be optimal for the deep learning model and/or which optimal solution (e.g., optimal weight to integer conversion, optimal input to integer conversion, optimal weight) can be attained on a designated hardware model. In some examples, a CPU in a desktop machine executes float arithmetic as fast as integer arithmetic, therefore either float or integer values could be optimal. In some examples a GPU in a desktop machine is optimized towards single precision float calculations, therefore the hardware is optimized for single precision float calculations. In examples herein, the quantization controller circuitry ID7_102 calculates tradeoffs between accuracy and speed. When the quantization controller circuitry ID7_102 determines to use an approximation closer to the floating point, the result will be a decrease in performance (e.g., speed, power consumption, throughput), but this can result in increased accuracy. On the other hand, when the quantization controller circuitry ID7_102 determines to use an integer value, the result will be an increase in performance (e.g., speed, power consumption, throughput), but this can result in decreased accuracy. In view of target performance metrics (e.g., target FLOPS, target/desired cache storage, target/desired frequency, etc.), the example quantization controller circuitry ID7_102 calculates decisions for how to quantize and how to approach quantization. In some examples, a reward function (or cost function) is applied to identify metrics in view of a score.
In summary, the quantization controller circuitry ID7_102, during an initial iteration, generates baseline metrics corresponding to a model, identifies operations corresponding to the model, generates a search space corresponding to the model, the search space including respective ones of the operations that can be quantized, and generates a first quantization topology, the first quantization topology corresponding to a first search strategy.
The quantization controller circuitry ID7_102 includes the example quantized topology creator circuitry ID7_104, which is an example of structure to create quantized topologies. Using operations (e.g., MatMul operations), the quantized topology creator circuitry ID7_104 utilizes the outputted descision from the quantization controller circuitry ID7_102 to in a first iteration, generate a guess at an optimal solution (e.g., optimal weight to integer conversion, optimal input to integer conversion, optimal weight) and generates a candidate topology (e.g., interconnected nodes and layers). In the first iteration, the example quantized topology creator circuitry ID7_104 generates a guess at an optimal hardware configuration and generates a candidate topology. During subsequent iterations, the example quantized topology creator circuitry ID7_104 calculates decisions based on observations/rewards and generates additional candidate optimal topologies until the quantization controller circuitry ID7_102 converges on an optimal solution. An example optimal solution is reached when the observations/rewards reach a threshold value (e.g., greater than 80% accuracy, less than 30 seconds run time). However, any number and/or type of optimal solution may be defined with alternate metrics (e.g., power consumption metrics, storage size metrics, accuracy metrics, etc.) The quantized topology creator circuitry ID7_104 includes deciding and/or otherwise selecting factors that include, but are not limited to, an initial bit width, a type of operation, a type of instruction associated with the operation, and/or an adjacency of instructions proximate to other instructions. Then example environment quantizer circuitry ID7_106 is structure to carry out quantization, benchmarking, and testing the neural network. The example environment quantizer circuitry ID7_106 measures any number of factors. An example of some factors include, but are not limited to accuracy, speed, size, and latency of the quantization. The environment quantizer circuitry ID7_106 carries out the action of quantization by deciding how to go from floating points to integers. The environment quantizer circuitry ID7_106 determines this through any number of factors (e.g., increase/decrease in accuracy, speed of quantization, size of topology, and/or user requirements, etc.) The environment quantizer circuitry ID7_106 carries out the action of benchmarking by comparing the performance of the neural network to other architectures (e.g., neural networks constructed with various combinations of the input layers (e.g., initial data for the neural network), hidden layers (e.g., intermediate layer between input and output layer and place where all the computation is done), and output layers (e.g., produce the result for given inputs)) using available benchmark data sets. Benchmarking can be accomplished using labeled data sets or through generated data sets. Labeled data sets are required to have large amounts of labeled data. Generated data sets have automatically labeled data and can show how a neural network excels at identifying slight errors. Generated data sets give a metric of the model's sensitivity and complexity. The example environment quantizer circuitry ID7_106 tests sensitivity in a generated data set by identifying errors and perturbations (e.g., “wildly incorrect” vs “incorrect” vs “mildly incorrect”). The more exact labels result in a neural network with higher sensitivity. The example environment quantizer circuitry ID7_106 tests complexity in a generated data set by increasing the number of objects (e.g., datasets, datapoints). The example environment quantizer circuitry ID7_106 tests the neural network for handling greater complexity. Using a labeled and/or generated dataset the environment quantizer circuitry ID7_106 executes benchmarking of a model (e.g., a NN of interest). The environment quantizer circuitry ID7_106 carries out the action of testing the neural network. In some examples, testing the neural network involves using a training dataset to determine if the neural network outputs the known optimal solution. Testing the neural network includes tuning the model's hyperparameters (e.g., the number of hidden units—layers and layer widths—in a neural network). In some examples, testing the neural network is used for regularization by early stopping (e.g., stopping training when the error on the dataset increases, which indicates overfitting to the training dataset). Thus, the example environment quantizer circuitry ID7_106 carries out the action of quantization, benchmarking, and testing the network.
The example environment quantizer circuitry ID7_106 includes the example reward assigner circuitry ID7_108, which is example structure for making observations and assigning rewards to actions (e.g., different permutations of quantization precisions, different combination groupings of operations, different permutations to “try,” etc.). The reward assigner circuitry ID7_108 takes the output of the quantizer circuitry ID7_106 (e.g., accuracy, speed, size, and latency of the act of quantization) and determines the value of actions associated with the output of the quantizer circuitry ID7_106 (e.g., higher accuracy assigns a higher reward value, lower speed assigns a higher reward value, etc.). The environment quantizer circuitry ID7_106 outputs results but there is no feedback of positive or negative reinforcement. Assigning rewards in neural networks results in keeping track of rewards and the resulting state after taking actions (e.g., different permutations of quantization precisions, different combination groupings of operations). Thus, actions with higher amounts of rewards that had a positive outcome and should repeat similar actions to attain higher results. Assigning rewards is accomplished by adding a reward value to an action or metric that is predefined. The reward value can be predefined by a user, a user input, a predetermined value in storage, and/or other ways of storing or inputting reward values. In some examples, an action or metric that the user assigns is the accuracy and latency of a system. In some examples, an action or metric that the user assigns is performance and speed. Thus, the reward assigner circuitry ID7_108 makes observations and assigns rewards to actions.
In summary, the example environment quantizer circuitry ID7_106, performs quantization on the first quantization topology and compares quantization results of the first quantization topology to the baseline metrics, the quantization controller to update the first search strategy during a second iteration, the second search strategy corresponding to an updated version of the model, the updated version of the model having improved metrics compared to the baseline metrics.
The example environment quantizer circuitry ID7_106 includes the search space and strategy solver circuitry ID7_110, which is an example structure for defining the search space and calculating a search strategy for quantization. The search space is defined as families of models specialized to solve deep learning problems. In some examples, the search strategies are automated, and some examples search strategies include, but are not limited to reinforcement learning (e.g., policy gradient), evolutionary algorithms (e.g., genetic algorithms), and heuristic search (e.g., branch & bound). In reinforcement learning the neural network takes actions to maximize reward in a particular situation. The search space and strategy solver circuitry ID7_110 employs finding the best possible behavior or path that should be taken in a specific situation. In evolutionary algorithms the neural network implements bio-inspired operators such as mutation, crossover, and selection to generate a solution to an optimization and search problems. In heuristic search the neural network uses a search strategy that attempts to optimize a problem by iteratively improving the solution based on a given heuristic function or a cost measure. Thus, the search space and strategy solver circuitry ID7_110 uses one or more of the following or any other search space and strategy solver to define the search space and calculate and/or otherwise determine one or more search strategies for quantization.
The combination of the example quantization controller circuitry ID7_102, the example quantized topology creator circuitry ID7_104, the example environment quantizer circuitry ID7_106, the example reward assigner circuitry ID7_108, and the example search space solver circuitry ID7_110 in FIG. ID7_1 is an example structure diagram for the schematic illustration of a framework (e.g., a system) for generating and providing optimal quantization weights to deep learning models. FIG. ID7_2 illustrates example methods and structure that form a framework ID7_200 for the example quantization controller circuitry ID7_102 and the example environment quantizer circuitry ID7_106 similar to the example in FIG. ID7_2. The example framework ID7_200, as described in further detail below, achieves generation and output of one or more optimized models ID7_220 for inferencing that is based on dynamic input data and/or conditions.
The example quantization controller circuitry ID7_102 from FIG. ID7_1 initiates a method for deciding what types of operations should be quantized and how to approach quantization. This requires multiple steps performed by the example quantized topology generator circuitry ID7_104 and, in some examples, includes inputting a labeled training set ID7_204, determining a search space of model (A) ID7_206, and invoking a search strategy to generate an augmented model to try (A′) ID7_208, as described in further detail below.
Example labeled training set(s) are received, retrieved and/or otherwise obtained from a user, or from a data source (e.g., historical data source) (ID7_204). The example labeled training set(s) are samples that have been tagged with one or more labels.
The quantization controller circuitry ID7_102 also initiates and determines the search space of model (A) ID7_206, such as a model of interest that is to be optimized. In some examples, the model (A) is to be moved from a centralized server (e.g., operating on hardware with first computational capabilities) to one or more Edge devices (e.g., operating on hardware with second computational capabilities that are less than those of the centralized server) The search space of the model A′ includes families of models and/or operations within such models that are capable of being quantized and/or otherwise specialized to solve deep learning problems. In some examples, this is performed by the example quantized topology creator circuitry ID7_104.
The quantization controller circuitry ID7_102 also initiates and invokes a search strategy to generate an augmented model (A′) ID7_through the application of one or more search strategies on the example model (A) ID7_206. As discussed above, example search strategies include reinforcement learning (e.g., policy gradient), evolutionary algorithms (e.g., genetic algorithms), and heuristic search (e.g., branch & bound). This is executed by the example quantized topology creator circuitry ID7_104.
The quantization controller circuitry ID7_102 sends the example augmented model (A′) to an example environment quantizer circuitry ID7_106.
The example environment quantizer circuitry ID7_106 (e.g., devices within an Edge network) executes and/or otherwise performs quantization, benchmarking, and testing of candidate augmented model(s) (A′).
The example reward assigner circuitry ID7_108 evaluates the augmented model (A′) ID7_214. and determines the performance of the quantization of the operation through observations and rewards. Additionally, the reward assigner circuitry ID7_108 facilitates the determination of the performance of the augmented model (A′) on a type of hardware. Of course, this iterative process performs such evaluations on any number of different types of hardware to ascertain performance metrics of quantized augmented models (A′). Some examples of hardware include, but are not limited to, any combination of one or more of CPUs, GPUs, FPGAs, quantum devices and/or accelerators. Results of testing respective ones of the augmented model (A′) (ID7_218) are iteratively fed back to invoke one or more alternate search strategies and generate further different augmented models (A′) to try. In response to an iteration threshold or one or more convergence metrics related to the performance of the augmented model (A′), the example environment quantizer circuitry ID7_106 outputs a model for inferencing based on the input training set ID7_220. The output (ID7_220) is sent by the search space and strategy solver circuitry ID7_110. The output represents an example output model for inferencing based on the input training set ID7_220 after any number of iterations that result in an optimal solution (e.g., convergence).
As discussed above, the example environment quantizer circuitry ID7_106 outputs a performance estimate of each iteration of an augmented model (A′) ID7_218. The performance estimate (e.g., performance data corresponding to latency metrics, accuracy metrics, etc.) of each augmented model (A′) ID7_218 includes sending information corresponding to the performance estimate ID7_214 of the quantization (e.g., observations and rewards) as well as information corresponding to the performance of the quantization of the operation for a particular type of hardware implemented by an example target platform ID7_216.
The example target platform ID7_216 is a structure that includes any type of hardware that the quantization is executed on. Some examples include, but are not limited to, any one or more combination of CPU's, GPU's, FPGA's and/or accelerators.
The combination of the input a labeled training set ID7_204, the determine search space of model (A), the invoke search strategy to generate A′, the A′ (e.g., model (A)), the evaluate performance ID7_214, the output a model for inferencing based on input training set ID7_220, and the performance estimate of A′ is a similar to FIG. ID7_1 and uses the structures including quantization controller circuitry ID7_102, the quantized topology creator circuitry ID7_104, the environment quantizer circuitry ID7_106, reward assigner circuitry ID7_108, and the search space solver circuitry ID7_110 to generate and provide optimal quantization weights to deep learning models.
While an example manner of implementing the framework for generating and providing optimal quantization weights to deep learning models of FIG. ID7_1 is illustrated in FIG. ID7_2 and ID7_3, ID7_4, ID7_5, and ID7_6, one or more of the elements, processes and/or devices illustrated in FIG. ID7_1 may be combined, divided, re-arranged, omitted, eliminated and/or implemented in any other way. Further, the example quantization controller circuitry ID7_102, the example quantizer topology generator circuitry ID7_104, the example environment quantizer circuitry ID7_106, the example reward assigner circuitry ID7_108 and/or, more generally, the example search space and strategy solver circuitry ID7_110 of FIG. ID7_1 may be implemented by hardware, software, firmware and/or any combination of hardware, software and/or firmware. Thus, for example, any of the example quantization controller circuitry ID7_102, the example quantizer topology generator circuitry ID7_104, the example environment quantizer circuitry ID7_106, the example reward assigner circuitry ID7_108 and/or, more generally, the example search space and strategy solver circuitry ID7_110 of FIG. ID7_1 could be implemented by one or more analog or digital circuit(s), logic circuits, programmable processor(s), programmable controller(s), graphics processing unit(s) (GPU(s)), digital signal processor(s) (DSP(s)), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)) and/or field programmable logic device(s) (FPLD(s)). When reading any of the apparatus or system claims of this patent to cover a purely software and/or firmware implementation, at least one of the example quantization controller circuitry ID7_102, the example quantizer topology generator circuitry ID7_104, the example environment quantizer circuitry ID7_106, the example reward assigner circuitry ID7_108 and/or, more generally, the example search space and strategy solver circuitry ID7_110 of FIG. ID7_1 is/are hereby expressly defined to include a non-transitory computer readable storage device or storage disk such as a memory, a digital versatile disk (DVD), a compact disk (CD), a Blu-ray disk, etc. including the software and/or firmware. Further still, the example quantization controller circuitry ID7_102, the example quantizer topology generator circuitry ID7_104, the example environment quantizer circuitry ID7_106, the example reward assigner circuitry ID7_108 and/or, more generally, the example search space and strategy solver circuitry ID7_110 of FIG. ID7_1 of FIG. ID7_1 may include one or more elements, processes and/or devices in addition to, or instead of, those illustrated in FIG. ID7_1, and/or may include more than one of any or all of the illustrated elements, processes and devices. As used herein, the phrase “in communication,” including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, and does not require direct physical (e.g., wired) communication and/or constant communication, but rather additionally includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or one-time events.
Flowcharts representative of example hardware logic, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the example quantization controller circuitry ID7_102, the example quantizer topology generator circuitry ID7_104, the example environment quantizer circuitry ID7_106, the example reward assigner circuitry ID7_108 and/or, more generally, the example search space and strategy solver circuitry ID7_110 of FIG. ID7_1 of FIG. ID7_1 are shown in FIGS. ID7_2, ID7_3 and ID7_6. The machine readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor and/or processor circuitry, such as the processor ID7_712 shown in the example processor platform ID7_700 discussed below in connection with FIGS. ID7_2, ID7_3 and ID7_6. The program(s) may be embodied in software stored on a non-transitory computer readable storage medium such as a CD-ROM, a floppy disk, a hard drive, a DVD, a Blu-ray disk, or a memory associated with the processor ID7_712, but the entire program and/or parts thereof could alternatively be executed by a device other than the processor ID7_712 and/or embodied in firmware or dedicated hardware. Further, although the example program is described with reference to the flowcharts illustrated in FIGS. ID7_2, ID7_3 and/or ID7_6, many other methods of implementing the example quantization controller circuitry ID7_102, the example quantizer topology generator circuitry ID7_104, the example environment quantizer circuitry ID7_106, the example reward assigner circuitry ID7_108 and/or, more generally, the example search space and strategy solver circuitry ID7_110 of FIG. ID7_1 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, or combined. Additionally or alternatively, any or all of the blocks may be implemented by one or more hardware circuits (e.g., discrete and/or integrated analog and/or digital circuitry, an FPGA, an ASIC, a comparator, an operational-amplifier (op-amp), a logic circuit, etc.) structured to perform the corresponding operation without executing software or firmware. The processor circuitry may be distributed in different network locations and/or local to one or more devices (e.g., a multi-core processor in a single machine, multiple processors distributed across a server rack, etc).
The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc. in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and stored on separate computing devices, wherein the parts when decrypted, decompressed, and combined form a set of executable instructions that implement one or more functions that may together form a program such as that described herein.
In another example, the machine readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc. in order to execute the instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example processes of FIGS. ID7_2, ID7_3, ID7_5, ID7_6 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on a non-transitory computer and/or machine readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
FIG. ID7_3 is a flowchart representative of machine-readable instructions which may be executed to implement an apparatus and method for generating and providing optimal quantization weights to deep learning models. FIG. ID7_3 is similar to FIG. ID7_1 and FIG. ID7_2 and shows another example representation of generating and providing optimal quantization weights to deep learning models. The illustrated example of FIG. ID7_3 is implemented by the example structure of FIG. ID7_1.
In the illustrated example of FIG. ID7_3, the quantization controller circuitry ID7_102 retrieves a model (block ID7_302). The example quantization controller circuitry ID7_102 obtains a new model that is introduced to the (FIG. ID7_1), which will go through pre-processing steps where it gathers the quantizable operations and quantizes the optimal weights. These pre-processing steps are pre-calculated so that they are not continuously repeated. In some examples, quantization occurs post-training where the deep learning model is trained using weights and inputs, then after training the weights are quantized. In some examples, quantization occurs during training where the gradients are calculated for the quantized weights. The example quantization controller circuitry ID7_102 retrieves and/or otherwise obtains a model (block ID7_302) (e.g., from a user, from a database of available models, etc.). However, in some instances the example quantization controller circuitry ID7_102 retrieves a model (block ID7_302) and obtains one or more models from any data source without user instruction. The example quantization controller circuitry ID7_102 retrieves a model ID7_302 and sends the model to establish a benchmark test (block ID7_304) and to parse a model to identify operations (block ID7_308).
The quantization controller circuitry ID7_102 executes the example process to establish benchmark test ID7_304. The benchmark test (block ID7_304) is an example of a process that obtains (e.g., automatically or in response to detecting the presence of one or more additional models) and/or retrieves a model (block ID7_302) and runs benchmark tests. Some example benchmark tests include but are not limited to latency and accuracy. After establishing benchmark test(s) (block ID7_304) the results of the benchmark test are stored in the example quantization controller circuitry ID7_102. These benchmark tests (block ID7_304) provide a baseline comparison so that during iterations of generating and providing optimal quantization weights the result will converge on a solution that results in better test results (e.g., higher accuracy) than the benchmark test (block ID7_304).
The quantization controller circuitry ID7_102 executes the example process to parse a model to identify operations (block ID7_308). The parse model to identify operations (block ID7_308) is an example of a process that automatically receives its input when it retrieves a model (block ID7_302) in addition to receiving all the operation configurations (e.g., MatMul, GatherNd). After parsing the model to identify operations (block ID7_308) the quantization controller circuitry ID7_102 automatically sends the results (e.g., MatMul, GatherNd) to the determine/create search space (block ID7_310) and the initialize hyperparams (block ID7_320). This is performed by the environment quantizer circuitry ID7_106. The parse model to identify operations (block ID7_308) provides the operation configuration, which groups models based on similar operation configurations and chooses similar quantization weights. After parsing the model to identify operations (block ID7_308) the quantization controller circuitry ID7_102 sends its output to initialize hyperparams (block ID7_320) described herein below.
The quantization controller circuitry ID7_102 executes the example process to initialize hyperparams (block ID7_320). The initialize hyperparams (block ID7_320) is an example of a process that automatically receives the parsed model that identifies operations (block ID7_308) and initializes the quantization parameters that were parsed. The quantization parameters are sent to the environment quantizer circuitry ID7_106. This provides the environment quantizer circuitry ID7_106 with the initialized hyperparams to quantize the selected operation(s) with when the quantizations parameters are sent to a quantize selected operation(s) (block ID7_326) described herein.
The example parse model to identify operations (block ID7_308) also sends its output parsed model to an example determine/create search space (block ID7_310). The quantization controller circuitry ID7_102 executes the example process to determine/create search space (block ID7_310). The determine/create search space (block ID7_310) is an example of a process that creates and/or otherwise builds a search space using the results from the example parse model to identify operations (block ID7_308) that were first calculated based on the example model retrieved initially (block ID7_302). The output of the determine/create search space (block ID7_310) is sent to a select operation combination (block ID7_312). At least one benefit of the determine/create search space (block ID7_310) is that human discretion is removed from the analysis process. Generally speaking, traditional techniques for determining how to quantize a model required human input and decisions of which layers/elements to quantize. However, such human input is subject to variation and error regarding candidate search spaces to be considered. Furthermore, a full and thorough exploitation of all possible permutations of search space is unrealistic if done by manual human efforts. Examples disclosed herein consider agent exploration efforts to identify candidate search spaces and generates resultant outputs. Such outputs are further examined for their corresponding effect on performance metrics such that the best quantization choices are implemented in one or more models.
The quantization controller circuitry ID7_102 executes the example process to select an operation combination (block ID7_312). The select operation combination is an example of a process that selects and sends the search space of quantizable operations built in the determine/create search space (block ID7_310) to the example generate quantization topology (block ID7_314). In some examples, the select operation combination (block ID7_312) chooses a random operation combination during a first iteration and in further iterations refine the operation combination to one that performs better than the baseline determined when the benchmark test was established (block ID7_304). In some examples the select operation combination (block ID7_312) uses grouping to decide which operation combination to select. Using grouping involves choosing an operation combination based on other models results that had similar operations (e.g., MatMul, GatherNd) and precision (e.g., int8, int16, bf16, etc.).
As described above, the example quantization controller circuitry ID7_102 illustrated in FIG. ID7_1, receives, retrieves and/or otherwise obtains one or more models (block ID7_302), and establishes and/or otherwise initiates a benchmark test (block ID7_304). Additionally, the example quantization controller circuitry ID7_102 parses the retrieved one or more models to identify operations (block ID7_308), and determines/creates a search space based on, for example, particular operations that are capable of being quantized (block ID7_310). The example quantized topology creator circuitry ID7_104 selects operation combinations (block ID7_312), and during respective iterations, the example quantization controller circuitry ID7_102 obtains the current state from the previous iteration and the performance metrics, such as accuracy, size, and latency, for the current state st. Using these inputs, the example quantization controller circuitry ID7_102 determines what operations are quantizable and how they should be quantized (e.g., grouping). The example quantization controller circuitry ID7_102 determines this through various convergence strategies including, but not limited to, reinforcement learning (e.g., policy gradient), evolutionary algorithms (e.g., genetic algorithms), and heuristic search (e.g., branch & bound). During the first few iterations, the example quantization controller circuitry ID7_102 will generate random quantization topologies, so that as the first few iterations execute, the example quantization controller circuitry ID7_102 is able to determine observations and rewards so the example quantization controller circuitry ID7_102 begins training with quantization topologies with optimal scored observations and rewards. The output of the example quantization controller circuitry ID7_102 is that action=next state st+1, which is arranged as a quantization topology for the environment quantizer circuitry ID7_106 input. The example quantization controller circuitry ID7_102 generates a quantization topology (block ID7_316) and sends the output (block ID7_316) to the environment quantizer circuitry ID7_106. The example quantization controller circuitry ID7_102 receives the result of the environment quantizer circuitry ID7_106 and updates the search strategy based on comparison (block ID7_318). The example quantization controller circuitry ID7_102, based on the results of the updated search strategy (block ID7_318) decides if the iterations should end or continue (block ID7_317). If the result of the updated search strategy (block ID7_318) reaches a criterion (e.g., significantly improves over the benchmark test, reaches a user designated optimal solution, etc.) the decision block ID7_317 can end the loop and output the resulting quantization and quantization weights. If the result of the updated search strategy (block ID7_318) does not reach a criterion (e.g., significantly improves over the benchmark test, reaches a user designated optimal solution, etc.) the decision block ID7_317 sends the updated search strategy (block ID7_318) to the generate quantization topology (block ID7_316) so a new iteration can begin.
The quantized topology creator circuitry ID7_104 within the quantization controller circuitry ID7_102 generates a quantization topology (block ID7_316) (e.g., based on inputs to the example quantization controller circuitry ID7_102). In some examples, the generated topology (block ID7_316) starts with a random topology if there is no prior data. Subsequent topologies are generated with guides from the process where the search strategy is updated based on comparison (block ID7_318) (e.g., agent). Once observations and rewards are recorded then the quantized topology creator circuitry ID7_104 generates a quantization topology (block ID7_316) based on optimal quantizations (e.g., quantization weights with relatively or comparatively high rewards recorded). In some examples the quantized topology creator circuitry ID7_104 will generate quantization topologies based on grouping where models with similar operations and precisions are quantized together or with similar weights. After the quantized topology creator circuitry ID7_104 generates the quantization topology (block ID7_316), the quantized topology creator circuitry ID7_104 sends its output to cause quantization of selected operations (block ID7_326), discussed in further detail below and forms a part of the example environment quantizer circuitry ID7_106.
As described above, the example environment quantizer circuitry ID7_106 carries out quantization tasks, benchmarking tasks, and/or testing as described in FIG. ID7_1. The example environment quantizer circuitry ID7_106 receives inputs from the example quantization controller circuitry ID7_102 to generate quantization topologies (block ID7_316). The example environment quantizer circuitry ID7_106 quantizes selected operation(s) (block ID7_326), compares results to the benchmark (block ID7_328), and analyzes the performance of the quantization topology (ID7_322). The performance results are sent (on an iterative basis) to the example controller circuitry ID7_102 to update search strategies based on comparisons (block ID7_318).
In some examples, the environment quantizer circuitry ID7_106 quantizes selected operation(s) based on any number of hyperparameters (block ID7_320). Using the selected operations (e.g., MatMul, GatherNd) and generated quantization topology the example environment quantizer circuitry ID7_106 quantizes selected operation(s) (block ID7_326) in view of such hyperparameters. This provides the resulting weighted quantization with the resulting metrics including accuracy, latency, and/or size.
The environment quantizer ID7_106 compares results to benchmark (block ID7_328). The example environment quantizer ID7_106 compares results to benchmark metrics (block ID7_328) determine benchmark metrics (e.g., throughput, accuracy, size, etc.) and the example reward assigner circuitry ID7_108 determines whether such metrics are an improvement over benchmark values evaluated in one or more prior iterations (block ID7_322).
The example reward assigner circuitry ID7_108 sends performance results to the example quantization controller circuitry ID7_102 to update a search strategy (e.g., of the current iteration) based on comparison (block ID7_318). An example of assigning rewards includes but is not limited to adding a reward value to an action or metric (e.g., an action or metric that the user predefined). An example of an action or metric that is assigned (e.g., by the user) is accuracy, latency, and/or size of a model. In some examples observing higher accuracy results in an incremental increase in weighted value. If higher accuracy is observed, then a relatively greater reward is assigned to the action or metric. In some examples observing a lower latency results in an incremental increase in weighted value. If lower latency is observed, then a relatively greater reward is assigned to the action or metric. In some examples, a smaller size of a model results in an incremental increase in weighted value. If a smaller size of a model is observed, then a relatively greater reward is assigned to the action or metric. The quantization controller circuitry ID7_102 executes the example process to update a current iteration of the search strategy based on comparison (block ID7_318). In some examples, updating the search strategy based on comparison(s) (block ID7_318) uses rewards and performance metrics to make decisions (e.g., make decision in block ID7_317 if the iterations should end or continue). Rewards and performance metrics from the example environment quantizer circuitry ID7_106 for corresponding actions are evaluated and sent to the update search strategy based on comparison (block ID7_318). Some examples of metrics for rewards include, but are not limited to accuracy, latency, and size. In some examples, accuracy is determined by quality metrics corresponding to a model with a relatively highest reward value. In some examples, latency is determined by assigning relatively higher rewards (e.g., relatively higher weight values) corresponding to relatively faster models. In some examples, size is determined by one or more reduced ratio metrics corresponding to an amount of memory and/or storage resources consumed by the model. In some examples, additional metrics such as memory bandwidth or power consumption are added from a hardware monitoring perspective and are added for better evaluation.
In response to the quantization controller circuitry ID7_102 completing updating the search strategy based on a comparison (block ID7_318), the quantization controller circuitry ID7_102 then determines whether the current iteration of the quantization strategy, current rewards and/or current performance results achieve a quantization topology that is improved (e.g., achieves an optimal solution (e.g., predefined optimal metrics) corresponding to performance thresholds (ID7_317). In some examples, the quantization controller circuitry ID7_102 determines that iterations should end based on a particular quantity of attempted iterations and/or epochs (block ID7_317). Otherwise, if the update search strategy based on comparison results in a search strategy that does not achieve an optimal solution or a particular quantity of iterations (block ID7_317), then the process advances to generate (another iteration of) the quantization topology (block ID7_316) where it will then go through the cycle of quantizing selected operations (block ID7_326) and compare results to benchmark (block ID7_328) and use the analyze performance (block ID7_322) to update search strategy based on comparison (block ID7_318).
FIG. ID7_4 is an example of candidate action spaces. In the illustrated example of FIG. ID7_4, different candidate action spaces are selected for a quantization topology. The example quantization type corresponds to the type of quantization used in the iteration such as, but not limited to INT8, BF16, and FP16.
FIG. ID7_5 illustrates an example process ID7_500 of how a pre-quantized model ID7_502 is analyzed by examples disclosed herein to generate a quantized model ID7_506. In the illustrated example of FIG. ID7_5, the pre-quantized model ID7_502 is an example of a model that is input into the quantization controller circuitry ID7_102, which takes the output of the quantized topology generator circuitry ID7_104 and quantizes the corresponding operations in the model. Example environment quantizer circuitry ID7_106 applies a particular quantization type ID7_504, such as INT8 or BF16, to the pre-quantized model ID7_502. As a result of a particular selection of a quantization type ID7_504, the example quantized model ID7_506 is augmented accordingly. In some examples, if one or more operations are adjacent, then they are automatically grouped to form a larger quantize-dequantize block to increase speed and performance. In the illustrated example of FIG. ID7_5, a first grouped block ID7_508 is an example of two operations that are quantized together because they have a similar operation (e.g., MatMul) and a similar precision (e.g., 1 (INT8)). Generally speaking, the example search space and strategy solver circuitry ID7_110 analyzes contents of the model of interest for particular operations and/or particular precisions. During any number of iterations, the search space and strategy solver circuitry ID7_110 creates and/or otherwise assigns different groups of operation(s) and corresponding precision values that, when performed in a particular grouping (e.g., two adjacent operations quantized together at a first precision, three adjacent operations quantized together at a second precision, etc.), result in different performance metrics (e.g., relatively faster quantization duration, relatively greater accuracy results, etc.). In some examples more permutations are searched and found or disregarded if they do not result in an optimal and/or otherwise improved solution. In some examples the operation(s) are different (e.g., MatMul vs. GatherNd), which leads to not quantizing the models together in a group. In some examples, the precision differs, which leads to not quantizing models together in a group (e.g., 2 (BF16) and 1(INT8)). Examples disclosed herein optionally expose one or more knobs to facilitate selection and/or adjustment of selectable options. Knobs may be selected by, for example, a user and/or an agent. Agent knob adjustment may occur in an automatic manner independent of the user in an effort to identify one or more particular optimized knob settings. In some examples, the example strategy solver circuitry ID7_110 stores identified groups of operations that cause an improvement in a metric of interest, such as efficiency, size, latency, etc. Such groups of operations may be stored in a database for future reference, such as during a runtime effort where operations can be looked up.
FIG. ID7_6 is a flowchart representative of an example process 600 of providing quantization weights to deep learning models. This is another example representation of the structure diagram in FIG. ID7_1 and the process diagrams in FIG. ID7_2 and FIG. ID7_3. The example process 600 includes receiving or obtaining a base neural network (block 604). A plurality of candidate operations is generated by the example quantization controller circuitry ID7_102 and are quantized from the base neural network (block 606). The candidate operations are then quantized by the example environment quantizer circuitry ID7_106 after generating a search space and selecting a strategy solver (block 608). Using the quantized candidate operations, a candidate quantized topology is generated (block 610) by the example quantized topology generator circuitry ID7_104. To quantize the candidate operations based on the candidate quantized topology an environment is generated (block 612) by the example environment quantizer circuitry ID7_106. The base neural network is then tested to determine benchmark metrics corresponding to performance of an underlying hardware platform (block 614) by the quantization controller circuitry ID7_102. Based on observations about the base neural network using metrics such as accuracy and latency rewards are assigned (block 616) by the example reward assigner circuitry ID7_108. Based on performance with the relatively highest reward values an optimal solution is converged on (block 618) by the example quantization controller circuitry ID7_102.
From the foregoing, it will be appreciated that example methods, apparatus, systems and articles of manufacture have been disclosed that improve quantization techniques of models. Disclosed methods, apparatus, systems and articles of manufacture improve the efficiency of using a computing device by creating a framework that automates the process of generating and providing optimal quantization weights to deep learning models. Furthermore, examples disclosed herein remove human discretion from the analysis process, which creates a more fast and efficient system and corresponding models that is/are not subject to variation and errors due to human involvement. Additionally, quantized models (e.g., the example quantized model ID7_506 of FIG. ID7_5) are automatically grouped to form a larger quantize-dequantize block to increase speed and performance. Disclosed methods, apparatus, systems and articles of manufacture are accordingly directed to one or more improvement(s) in the functioning of a computer.
Further variations of examples disclosed herein are provided by the following examples.
Example methods, apparatus, systems, and articles of manufacture to optimize resources in Edge networks are disclosed herein. Further examples and combinations thereof include the following: Example 119 includes an apparatus comprising at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus, a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations, or Application Specific Integrate Circuitry including logic gate circuitry to perform one or more third operations, the processor circuitry to at least one of perform at least one of the first operations, the second operations or the third operations to generate baseline metrics corresponding to a model, identify operations corresponding to the model, generate a search space corresponding to the model, the search space including respective ones of the operations that can be quantized, generate a first quantization topology, the first quantization topology corresponding to a first search strategy, perform quantization on the first quantization topology, and compare quantization results of the first quantization topology to the baseline metrics, the quantization controller to update the first search strategy to a second search strategy during a second iteration, the second search strategy corresponding to an updated version of the model, the updated version of the model having improved metrics compared to the baseline metrics.
Example 120 includes the apparatus as defined in example 119, wherein the processor circuitry is to calculate the first factors of the first quantization topology, the first factors including adjacency information corresponding to the operations that can be quantized.
Example 121 includes the apparatus as defined in example 119, wherein the processor circuitry is to identify operation groups corresponding to the improved metrics.
Example 122 includes the apparatus as defined in example 119, wherein the processor circuitry is to perform a search strategy that includes at least one of reinforcement learning, evolutionary algorithms, or heuristic search.
Example 123 includes the apparatus as defined in example 119, wherein the processor circuitry is to calculate model metrics including at least one of throughput, accuracy, latency, or size.
Example 124 includes the apparatus as defined in example 119, wherein the processor circuitry is to calculate metrics for hardware including at least one of memory bandwidth, power consumption, or speed.
Example 125 includes the apparatus as defined in example 119, wherein the processor circuitry is to store operations that have been identified as candidate groups to improve efficiency.
Example 126 includes the apparatus as defined in example 119, wherein the processor circuitry is to retrieve one or more groups of operations from a storage device, the groups to be quantized together.
Example 127 includes At least one non-transitory computer readable storage medium comprising instructions that, when executed, cause at least one processor to at least generate baseline metrics corresponding to a model, identify operations corresponding to the model, generate a search space corresponding to the model, the search space including respective ones of the operations that can be quantized, generate a first quantization topology, the first quantization topology corresponding to a first search strategy, perform quantization on the first quantization topology, and compare quantization results of the first quantization topology to the baseline metrics, the quantization controller to update the first search strategy to a second search strategy during a second iteration, the second search strategy corresponding to an updated version of the model, the updated version of the model having improved metrics compared to the baseline metrics.
Example 128 includes the at least one computer readable storage medium as defined in example 127, wherein the instructions, when executed, cause the at least one processor to calculate the first factors of the first quantization topology, the first factors including adjacency information corresponding to the operations that can be quantized.
Example 129 includes the at least one computer readable storage medium as defined in example 127, wherein the instructions, when executed, cause the at least one processor to identify operation groups corresponding to the improved metrics.
Example 130 includes the at least one computer readable storage medium as defined in example 127, wherein the instructions, when executed, cause the at least one processor to perform a search strategy that includes at least one of reinforcement learning, evolutionary algorithms, or heuristic search.
Example 131 includes the at least one computer readable storage medium as defined in example 127, wherein the instructions, when executed, cause the at least one processor to calculate model metrics including at least one of throughput, accuracy, latency, or size.
Example 132 includes the at least one computer readable storage medium as defined in example 127, wherein the instructions, when executed, cause the at least one processor to calculate metrics for hardware including at least one of memory bandwidth, power consumption, or speed.
Example 133 includes a method comprising generating baseline metrics corresponding to a model, identifying operations corresponding to the model, generating a search space corresponding to the model, the search space including respective ones of the operations that can be quantized, generating a first quantization topology, the first quantization topology corresponding to a first search strategy, performing quantization on the first quantization topology, and comparing quantization results of the first quantization topology to the baseline metrics, the quantization controller to update the first search strategy to a second search strategy during a second iteration, the second search strategy corresponding to an updated version of the model, the updated version of the model having improved metrics compared to the baseline metrics.
Example 134 includes the method as defined in example 133, further including calculating the first factors of the first quantization topology, the first factors including adjacency information corresponding to the operations that can be quantized.
Example 135 includes the method as defined in example 133, further including identifying operation groups corresponding to the improved metrics.
Example 136 includes the method as defined in example 133, further including performing a search strategy that includes at least one of reinforcement learning, evolutionary algorithms, or heuristic search.
Example 137 includes the method as defined in example 133, further including calculating model metrics including at least one of throughput, accuracy, latency, or size.
Example 138 includes the method as defined in example 133, further including calculating metrics for hardware including at least one of memory bandwidth, power consumption, or speed.
Example 139 includes an apparatus to optimize a model, comprising a quantization controller to, during an initial iteration generate baseline metrics corresponding to a model, identify operations corresponding to the model, generate a search space corresponding to the model, the search space including respective ones of the operations that can be quantized, and generate a first quantization topology, the first quantization topology corresponding to a first search strategy, an environment quantizer to perform quantization on the first quantization topology, and compare quantization results of the first quantization topology to the baseline metrics, the quantization controller to update the first search strategy to a second search strategy during a second iteration, the second search strategy corresponding to an updated version of the model, the updated version of the model having improved metrics compared to the baseline metrics.
Example 140 includes the apparatus as defined in example 139, wherein the quantization controller is to calculate the first factors of the first quantization topology, the first factors including adjacency information corresponding to the operations that can be quantized.
Example 141 includes the apparatus as defined in example 139, wherein the environment quantizer is to identify operation groups corresponding to the improved metrics.
Example 142 includes the apparatus as defined in example 139, wherein the environment quantizer is to perform a search strategy that includes at least one of reinforcement learning, evolutionary algorithms, or heuristic search.
Example 143 includes the apparatus as defined in example 139, wherein the environment quantizer is to calculate model metrics including at least one of throughput, accuracy, latency, or size.
Example 144 includes the apparatus as defined in example 139, wherein the environment quantizer is to calculate metrics for hardware including at least one of memory bandwidth, power consumption, or speed.
Examples disclosed herein are consistent with International Publication No. WO/2019/197855 (International Application No. PCT/IB2018/000513) filed on Apr. 9, 2018. International Publication No. WO/2019/000513 is incorporated by reference herein in its entirety.
Convolutional neural networks (CNNs) may be used in computer vision applications to support various tasks (e.g., object detection). The relatively large number of parameters and high computational cost of such networks, however, may render them difficult to use in power-constrained “Edge” devices such as smart cameras.
Conventional attempts to reduce the number parameters and/or complexity of CNNs may identify redundancies in the network during training and statically remove the redundancies to obtain a final network configuration. Such an approach may result in lower accuracy depending on the image context encountered after deployment of the network.
Turning now to FIG. ID11_A, a portion of a neural network ID11_A100 is shown including an example first branch implementation controller ID11_A101. The portion of the neural network ID11_A100 includes a second network layer ID11_A102 (ID11_A102a, ID11_A102b, e.g., convolutional, rectified linear unit/ReLU, pooling, fully connected (FC) layer, etc.) coupled to an output of a first network layer ID11_A104 (e.g., convolutional, ReLU, pooling FC layer, etc.). In one example, the input to the first network layer ID11_A104 holds raw pixel values of an image, where the first network layer ID11_A104 is a convolutional layer that extracts features (e.g., edges, curves, colors) from the image. The result may be an activation map ID11_A106 that indicates which regions of the image are likely to contain the features that the first network layer ID11_A104 is configured to extract. Configuring the first network layer ID11_A104 to extract certain features may be done during a training procedure in which known input images are fed to a neural network including the portion of the neural network ID11_A101, and filter weights of the first network layer ID11_A104 are adjusted to achieve a targeted result. Because the convolution process may involve a relatively high number of multiplication operations (e.g., dot product calculations between image pixel values and filter weights), the first network layer ID11_A104 may represent a correspondingly large portion of the computational cost/expense of the neural network including the portion of the neural network ID11_A101. Similarly, the second network layer ID11_A102 might have a high computational cost.
For example, the computational complexity of a convolution layer may be determined by:
Num_of_input_channels×kernel_width×kernel_height×Num_of_output_channels
Although the ability to change the kernel size of the convolution operation may be limited, Num_of_input_channels and/or Num_of_output_channels may be manipulated to decrease computations during inferences.
As will be discussed in greater detail, the first branch implementation controller ID11_A101 includes and/or implements a lightweight branch path ID11_A108 located (e.g., positioned and/or connected) between the first network layer ID11_A104 and the second network layer ID11_A102 may be used to prune unimportant channels (e.g., red channel, green channel, blue channel) from the activation map ID11_A106. More particularly, the branch path ID11_A08 and/or more generally the first branch implementation controller ID11_A101, may include a context aggregation component ID11_A110 that aggregates context information from the first network layer ID11_A104. In one example, the context information includes channel values (e.g., red channel values, green channel values, blue channel values) associated with the first network layer ID11_A104. Moreover, the context aggregation component ID11_A110 may be a downsample (e.g., pooling) layer that averages channel values in the first network layer ID11_A104. Additionally, the illustrated branch path ID11_A108 includes a plurality of FC layers ID11_A112 (ID11_A112a, ID11_A112b), implemented and/or executed by the example first branch implementation controller ID11_A101, that conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer ID11_A104 from consideration by the second network layer ID11_A102 based on the importance classification. The FC layers may generally function as memory that documents/memorizes various input data that is fed to the network during training. In the illustrated example, an unimportant channel portion ID11_A102b of the second network layer ID11_A102 is excluded from consideration and an important channel portion ID11_A102a is not excluded from consideration. As a result, the smaller second network layer ID11_A102 may facilitate faster inferences without incurring a loss of accuracy.
Thus, if the first network layer ID11_A104 has 256 output neurons, the context aggregation component ID11_A110 might provide a “blob” of 256 values to a first FC layer ID11_A112a, where the first FC layer ID11_A112a generates a high-level feature vector having ID11_A114 elements/output neurons (e.g., with the value of each output neuron indicating the likelihood of that neuron being activated). Additionally, a second FC layer ID11_A112b may generate an importance score vector based on the high-level feature vector, where the importance score vector has 256 output neurons. The second FC layer ID1_A112b may generally make higher level classifications than the first FC layer ID11_A112a. The importance score vector may contain zero values for neurons in less important channels. Accordingly, passing the activation map ID11_A106 and the importance score vector through a multiplier ID11_A114 may selectively exclude all neurons in the less important channels. The FC layers ID11_A112 may be considered “fully connected” to the extent that every neuron in the previous layer is connected to every layer in the next layer.
Of particular note is that the context aggregation component ID11_A110, implemented by the example first branch implementation controller ID11_A101, aggregates the context information in real-time (e.g., on-the-fly) and after the training of the neural network. Accordingly, accuracy may be increased while accelerating inferences, regardless of the image context encountered after deployment of the neural network. For example, if the neural network is deployed in an application that processes images lacking features that were present in the images used in training, the illustrated pruning approach is able to reduce processing time by eliminating the channels configured to extract the missing features. Moreover, the technology described herein may facilitate the discard of some insignificant features that may otherwise prevent the network from making an accurate decision. Accordingly, the branch path ID11_A108 may be considered a regularization technique. As will be discussed in greater detail, the post-training pruning may use either a fixed pruning ratio constraint or an “adversarial” balance between a layer width loss and an accuracy constraint.
FIG. ID11_B illustrates an example second branch implementation controller ID11_B101 to implement a branch path ID11_B102 that uses a fixed pruning ratio constraint to accelerate neural network inferences. In some examples, the portion of the neural network ID11_A100 of FIG. ID11_A includes the second branch implementation controller ID11_B101 in place of the first branch implementation controller ID11_A101. The fixed pruning ratio constraint may generally be a percentage of channels to be pruned. In the illustrated example, the second branch implementation controller ID11_B101 implements a first FC layer ID11_B104 that is coupled to a ReLU ID11_B106 (“ReLU1”) and introduces nonlinearity (e.g., clipping activation by a threshold of one) into the output (e.g., probability vector having ID11_A14 output neurons) of the first FC layer ID11_B104. The example second branch implementation controller ID11_B101 implements a second FC layer ID11_B108 that may be coupled to the ReLU ID11_B106, where the output (e.g., probability vector having 256 output neurons) of the second FC layer ID11_B108 may be processed by an adaptive bias component ID11_B110 (e.g., layer). The example second branch implementation controller ID11_B101 implements the adaptive bias component ID11_B110. The adaptive bias component ID11_B110 may calculate a bias that controls the ratio between positive and negative values in the probability vector from the second FC layer ID11_B108, where the ratio may be set based on a fixed pruning ratio constraint (e.g., 80% important, 20% unimportant). Additionally, a threshold layer ID11_B112, implemented by the example second branch implementation controller ID11_B101, may set (e.g., truncate) all negative values in the output of the adaptive bias component ID11_B110 to zero and set all positive values in the output of the adaptive bias component ID11_B110 to one. Thus, passing an activation map (not shown) and the output of the threshold layer ID11_B112 through a multiplier ID11_B114 may selectively exclude all neurons in the less important channels, with importance being enforced via the pruning ratio constraint.
FIG. ID11_C illustrates an example third branch implementation controller ID11_C101 to implement a branch path ID11_C102 that uses an adversarial balance between a layer width loss ID11_C104 of the first network layer and an accuracy loss ID11_C106 (e.g., accuracy constraint) of the first network layer. In some examples, the portion of the neural network ID11_A100 of FIG. ID11_A includes the third branch implementation controller ID11_C101 in place of the first branch implementation controller ID11_A100. In the illustrated example, the third branch implementation controller ID11_C101 implements a first FC layer ID11_C108 that is coupled to an ReLU ID11_C110 and that introduces nonlinearity (e.g., clipping activation by a threshold of one) into the output (e.g., probability vector having ID11_A14 output neurons) of the first FC layer ID11_C108. The third branch implementation controller ID11_C101 implements a second FC layer ID11_C112. The second FC layer ID11_C112 may be coupled to the ReLU ID11_C110, where the output (e.g., probability vector having 256 output neurons) of the second FC layer ID11_C112 may be processed by another instance of the ReLU ID11_C114. The ReLU ID11_C114, implemented by the example third branch implementation controller ID11_C101, may set some values in the output vector to zero. Accordingly, a multiplier ID11_C116, implemented by the example third branch implementation controller ID11_C101, may selectively exclude all neurons in the less important channels, with importance being enforced via a pruning ratio constraint.
During the training of the neural network, the layer width loss ID11_C104 may be provided to the ReLU ID11_C114, while the accuracy loss ID11_C106 (e.g., accuracy constraint) is provided to the multiplier ID11_C116. The layer width loss ID11_C104 may be determined based on the pruning ratio constraint. In one example, the layer width loss is determined by calculating the mean across all elements (e.g., output neurons) of the vector of multipliers and then computing the Euclidean norm (e.g., distance) between the mean and the pruning ratio constraint. Accordingly, the calculated loss may be considered to be a penalty for layer width. During the training of the neural network, the accuracy loss ID11_C106 may be balanced against the layer width loss ID11_C104. In one example, balancing determines the optimal tradeoff between channel reduction and accuracy. More particularly, during the training process, there may be an adversarial situation where compliance with the constraint imposed by the accuracy loss ID11_C106 minimizes the error of the network, but the layer width loss ID11_C104 minimizes the number of channels and results in a penalty if the number of channels does not comply with the pruning ratio constraint.
While an example manner of implementing the portion of the neural network ID11_A100 of FIG. ID11_A is illustrated in FIGS. ID11_B and ID11_C, one or more of the elements, processes and/or devices illustrated in FIGS. ID11_B and ID11_C may be combined, divided, re-arranged, omitted, eliminated and/or implemented in any other way. Further, the example first branch implementation controller ID11_A101, the example second branch implementation controller ID11_B101, the example third branch implementation controller ID11_C101, and/or, more generally, the example portion of the neural network ID11_A100 of FIG. ID11_A may be implemented by hardware, software, firmware and/or any combination of hardware, software and/or firmware. Example hardware implementation include implementation on the example compute circuitry D102 (e.g., the example processor D104) of
Flowcharts representative of example hardware logic, machine-readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the portion of the neural network ID11_A100 of FIG. ID11_A is shown in FIGS. ID11_D and ID11_E. The machine-readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor and/or processor circuitry, such as the processor D152 shown in the example processor platform D150 discussed above in connection with
As mentioned above, the example processes of FIGS. ID11_D and ID11_E may be implemented using executable instructions (e.g., computer and/or machine-readable instructions) stored on a non-transitory computer and/or machine-readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
FIG. ID11_D shows a method ID11_D100 of conducting pruning operations. The method ID11_D100 may generally be implemented in a neural network including any one of the branch implementation controllers such as, for example, the first branch implementation controller ID11_A101 executing the branch path ID11_A08 (FIG. ID11_A), the second branch implementation controller ID11_B101 executing the branch path ID11_B102 (FIG. ID11_B) and/or the third branch implementation controller ID11_C101 executing the branch path ID11_C102 (FIG. ID11_C). More particularly, the method ID11_D100 may be implemented as one or more modules in a set of logic instructions stored in a machine- or computer-readable storage medium such as random access memory (RAM), read only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc., in configurable logic such as, for example, programmable logic arrays (PLAs), field programmable gate arrays (FPGAs), complex programmable logic devices (CPLDs), in fixed-functionality hardware logic using circuit technology such as, for example, application specific integrated circuit (ASIC), complementary metal oxide semiconductor (CMOS) or transistor-transistor logic (TTL) technology, or any combination thereof.
For example, computer program code to carry out operations shown in the method ID11_D100 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. Additionally, logic instructions might include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, state-setting data, configuration data for integrated circuitry, state information that personalizes electronic circuitry and/or other structural components that are native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).
At processing block ID11_D102, the first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 provides for training a neural network having a second network layer coupled to an output of a first network layer. In an adversarial balancing architecture, at block ID11_D102, the third branch implementation controller ID11_C101 may determine a layer width loss of the first network layer based on a pruning ratio constraint and balancing, during the training of the neural network, an accuracy constraint of the first network layer against the layer width loss. At block ID11_D102, the first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 may prune the portion of the neural network ID11_A100 during the training of the neural network. The training stage pruning may include static techniques such as, for example, randomly removing neurons or groups of neurons from the network. The static techniques may also involve considering an absolute magnitude of weights and activations (e.g., importance of neurons) and removing the least of them in each network layer. In yet another example, the static techniques may consider an error of the network during the training time and attempt to learn parameters that represent the probability that a particular neuron or group of neurons may be dropped. The result of the training may be a final network configuration that may be pruned again dynamically after deployment as described herein.
The example first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 aggregates the context information at block ID11_D104 from a first network layer in the neural network, where the context information is aggregated in real-time and after a training of the neural network. Thus, the context information may correspond to post deployment input data (e.g., inference images). In one example, block ID11_D104 includes averaging, by a downsample (e.g., pooling) layer in a branch path located between the first network layer and the second network layer, channel values in the first network layer. The example first branch implementation controller ID11_A101, the example second branch implementation controller ID11_B101 and/or the example third branch implementation controller ID11_C101 may utilize other approaches to aggregate the context information. At block ID11_D106, the first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 may conduct an importance classification of the context information, where one or more channels in the first network layer may be excluded from consideration by the second network layer at block ID11_D108 based on the importance classification. At block ID11_D108, the example first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 select the one or more channels based on a pruning ratio constraint (e.g., percentage of channels to be pruned).
FIG. ID11_E shows a method ID11_E100 of conducting importance classifications of aggregated context information. The method ID11_E100 may readily substituted for block ID11_D106 (FIG. ID11_D), already discussed. More particularly, the method ID11_E100 may be implemented by the example first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 as one or more modules in a set of logic instructions stored in a machine- or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in configurable logic such as, for example, PLAs, FPGAs, CPLDs, in fixed-functionality hardware logic using circuit technology such as, for example, ASIC, CMOS or TTL technology, or any combination thereof.
At processing block ID11_E102, the example first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 generates, by a first FC layer in a branch path located between the first network layer and the second network layer, a high-level feature vector associated with the first network layer based on the aggregated context information. Additionally, at block ID11_E104, the example first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 may generate, by a second FC layer in the branch path, an importance score vector based on the high-level feature vector, where the importance score vector contains zero values for less important channels. In such a case, at block ID11_D108 (FIG. ID11_D), the example first branch implementation controller ID11_A101, the second branch implementation controller ID11_B101, and/or the third branch implementation controller ID11_C101 may multiply the output of the first network layer by the importance score vector.
Turning now to FIG. ID11_F, a computer vision system ID11_F100 (e.g., computing system) is shown. The system ID11_F100 may generally be part of an electronic device/platform having computing functionality (e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server), communications functionality (e.g., smart phone), imaging functionality (e.g., camera, camcorder), media playing functionality (e.g., smart television/TV), wearable functionality (e.g., watch, eyewear, headwear, footwear, jewelry), vehicular functionality (e.g., car, truck, motorcycle), robotic functionality (e.g., autonomous robot), etc., or any combination thereof. In the illustrated example, the system ID11_F100 includes one or more processors ID11_F102 (e.g., host processor(s), central processing unit(s)/CPU(s), vision processing units/VPU(s)) having one or more cores ID11_F104 and an integrated memory controller (IMC) ID11_F106 that is coupled to a system memory ID11_F108.
The illustrated system ID11_F100 also includes an input output (IO) module ID11_F110 implemented together with the processor(s) ID11_F102 on a semiconductor die ID11_F112 as a system on chip (SoC), where the IO module ID11_F110 functions as a host device and may communicate with, for example, a display ID11_F114 (e.g., touch screen, liquid crystal display/LCD, light emitting diode/LED display), a network controller ID11_F116 (e.g., wired and/or wireless), one or more cameras ID11_F115, and mass storage ID11_F118 (e.g., hard disk drive/HDD, optical disk, solid state drive/SSD, flash memory). The processor(s) ID11_F102 may execute instructions ID11_F120 (e.g., a specialized kernel inside a Math Kernel Library for Deep Learning Networks/MKL-DNN) retrieved from the system memory ID11_F108 and/or the mass storage ID11_F118 to perform one or more aspects of the method ID11_D100 (FIG. ID11_D) and/or the method ID11_E100 (FIG. ID11_E), already discussed.
Thus, execution of the instructions ID11_F120 may cause the system ID11_F100 to aggregate context information from a first network layer in a neural network having a second network layer coupled to an output of the first network layer, where the context information is aggregated in real-time and after a training of the neural network. The context information may be associated with image data (e.g., still images, video frames) captured by the camera(s) ID11_F115. Additionally, execution of the instructions ID11_F120 may cause the system ID11_F100 to conduct an importance classification of the aggregated context information and selectively exclude one or more channels in the first network layer from consideration by the second network layer based on the importance classification.
FIG. ID11_G shows a semiconductor apparatus ID11_G100 (e.g., chip, die, package). The illustrated apparatus ID11_G100 includes one or more substrates ID11_G102 (e.g., silicon, sapphire, gallium arsenide) and logic ID11_G104 (e.g., transistor array and other integrated circuit/IC components) coupled to the substrate(s) ID11_G102. The logic ID11_G104 may implement one or more aspects of the method ID11_D100 (FIG. ID11_D) and/or the method ID11_E100 (FIG. ID11_E), already discussed. Thus, the logic ID11_G104 may aggregate context information from a first network layer in a neural network having a second network layer coupled to an output of the first network layer, where the context information is aggregated in real-time and after a training of the neural network. The logic ID11_G104 may also conduct an importance classification of the aggregated context information and selectively exclude one or more channels from consideration by the second network layer based on the importance classification. The logic ID11_G104 may be implemented at least partly in configurable logic or fixed-functionality hardware logic. The logic ID11_G104 may also include the neural network. In one example, the logic ID11_G104 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) ID11_G102. Thus, the interface between the logic ID11_G104 and the substrate(s) ID11_G102 may not be an abrupt junction. The logic ID11_G104 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) ID11_G102.
FIG. ID11_H illustrates a processor core ID11_H100 according to one embodiment. The processor core ID11_H100 may be the core for any type of processor, such as a micro-processor, an embedded processor, a digital signal processor (DSP), a network processor, or other device to execute code. Although only one processor core ID11_H100 is illustrated in FIG. ID11_H, a processing element may alternatively include more than one of the processor core ID11_H100 illustrated in FIG. ID11_H. The processor core ID11_H100 may be a single-threaded core or, for at least one embodiment, the processor core ID11_H100 may be multithreaded in that it may include more than one hardware thread context (or “logical processor”) per core.
FIG. ID11_H also illustrates a memory ID11_H170 coupled to the processor core ID11_H100. The memory ID11_H170 may be any of a wide variety of memories (including various layers of memory hierarchy) as are known or otherwise available to those of skill in the art. The memory ID11_H170 may include one or more code ID11_H113 instruction(s) to be executed by the processor core ID11_H100, where the code ID11_H113 may implement the method ID11_D100 (FIG. ID11_D) and/or the method ID11_E100 (FIG. ID11_E), already discussed. The processor core ID11_H100 follows a program sequence of instructions indicated by the code ID11_H113. Each instruction may enter a front end portion ID11_H110 and be processed by one or more decoders ID11_H120. The decoder ID11_H120 may generate as its output a micro operation such as a fixed width micro operation in a predefined format, or may generate other instructions, microinstructions, or control signals which reflect the original code instruction. The illustrated front end portion ID11_H110 also includes register renaming logic ID11_H125 and scheduling logic ID11_H130, which generally allocate resources and queue the operation corresponding to the convert instruction for execution.
The processor core ID11_H100 is shown including execution logic ID11_H150 having a set of execution units ID11_H155a through ID11_H155n. Some embodiments may include a number of execution units dedicated to specific functions or sets of functions. Other embodiments may include only one execution unit or one execution unit that can perform a particular function. The illustrated execution logic ID11_H150 performs the operations specified by code instructions.
After completion of execution of the operations specified by the code instructions, back end logic ID11_H160 retires the instructions of the code ID11_H113. In one embodiment, the processor core ID11_H100 allows out of order execution but requires in order retirement of instructions. Retirement logic ID11_H165 may take a variety of forms as known to those of skill in the art (e.g., re-order buffers or the like). In this manner, the processor core ID11_H100 is transformed during execution of the code ID11_H113, at least in terms of the output generated by the decoder, the hardware registers and tables utilized by the register renaming logic ID11_H125, and any registers (not shown) modified by the execution logic ID11_H150.
Although not illustrated in FIG. ID11_H, a processing element may include other elements on chip with the processor core ID11_H100. For example, a processing element may include memory control logic along with the processor core ID11_H100. The processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic. The processing element may also include one or more caches.
Further variations of examples disclosed herein are provided by the following examples.
Example ID11_A1 is a system including a processor, and a memory coupled to the processor, the memory including executable computer program instructions, which when executed by the processor, cause the system to train a neural network comprising at least a first network layer and a second network layer, wherein the second network layer is coupled to an output of the first network layer, and wherein the first network layer has a plurality of channels, aggregate context information from the first network layer of the neural network, wherein the context information is to be aggregated in real-time and after a training of the neural network, and wherein the context information is to include channel values, generate a feature vector associated with the first network layer of the neural network based on the aggregated context information, generate an importance score vector based on the generated feature vector, wherein the importance score vector includes information indicating importance of corresponding channels of the first network layer, and selectively exclude one or more channels in the first network layer from consideration by the second network layer based on the importance score vector.
In Example ID11_A2, the subject matter of Example ID11_A1 optionally includes the instructions, when executed, cause the system to average the channel values to aggregate the context information from the first network layer of the neural network.
In Example ID11_A3, the subject matter of any one or more of Examples ID11_A1 through ID11_A2 optionally includes the importance score vector having zero values for neurons in less important channels of the first network layer.
In Example ID11_A4, the subject matter of Example ID11_A3 optionally includes the instructions, when executed, further cause the system to multiply the output of the first network layer by the importance score vector.
In Example ID11_A5, the subject matter of any one or more of Examples ID11_A1 through ID11_A4 optionally including the instructions, when executed, cause the system to select the one or more channels of the first network layer based on a pruning ratio constraint, wherein the pruning ratio constraint is a percentage of channels to be pruned.
In Example ID11_A6, the subject matter of any one or more of Examples ID11_A1 through ID11_A5 optionally including the instructions, when executed, cause the computing system to determine a layer width loss of the first network layer based on a pruning ratio constraint, wherein the pruning ratio constraint is a percentage of channels to be pruned, and balance, during the training of the neural network, an accuracy constraint of the first network layer against the layer width loss.
Convolutional neural networks (CNNs) are a class of deep neural networks (DNNs) that are typically employed to analyze visual images, as well as other types of patterned data. In some examples, CNNs can be trained to learn features and/or classify data. For example, a CNN can be trained to learn weights in filters (or kernels). As used herein, a kernel is a matrix of weights. In operation, one or more kernels may be multiplied with an input to extract feature information. As used herein, a filter is one or more kernels, and a convolution layer includes two or more filters. The trained model can be then used to identify or extract information such as edges, segments, etc., in an input image. Each convolutional kernel may be defined by a width and height (hyper parameters). Additionally, convolution layers typically convolve the inputs (e.g., input image, kernels, etc.) and pass the output to a next layer of the CNN. Example weights disclosed herein include one or more values represented by alphanumeric characters. Such values may be stored in one or more data structures (e.g., data structure(s)), in which example data structures include integers, floating point representations and/or characters. Weights and corresponding values to represent such weights represent data stored in any manner. Such data may also propagate from a first data structure to a second data structure or any number of subsequent data structures along a data path, such as a bus.
Currently, the computational power needed to train a CNN depends on the number of kernels trained as well as the sizes of the kernels (e.g., m×n) and the input images. Current (traditional) neural network approaches result in large neural network models having hundreds of kernels in many layers, which require memory resources. Such requirements strain the ability for devices at the Edge (Edge devices) to operate efficiently when such Edge devices are typically bound by limited memory, processing and/or memory capabilities. Additionally, efforts to increase a degree of accuracy of neural networks typically involves implementing additional layers. However, adding layer depth will also make such models more difficult to train.
Some examples disclosed herein include methods, systems, articles of manufacture and apparatus that reduce the number of kernels and/or the number of layers used to train a CNN. In some examples, a dynamic adaptive kernel is generated for each region of an input image by convolving each input region with multiple kernels to build a dynamic kernel specific for that input region. The generated kernel is then convolved with that same input region to generate a single pixel output. Thus, instead of multiple outputs associated with multiple kernels, a single output can be provided for all the multiple kernels. In this way, the total number of kernels trained in the CNN layer (and thus the overall computational power needed to train the CNN) can be reduced.
FIG. ID14_1 is a conceptual illustration ID14_100 of an example convolution operation using a static kernel. As shown in FIG. ID14_1, an input image ID14_110 includes 9×9 image pixels represented by a grid. In the illustrated example of FIG. ID14_1, each cell in the grid of input image ID14_110 represents an image pixel value. Further, in the illustrated example of FIG. ID14_1, an example kernel ID14_120 is represented as a 3×3 grid of weights. In some examples, the kernel ID14_120 is referred to as a mask, which is convolved with the input image ID14_110. The kernel ID14_120 is then applied (e.g., convolved) to portions of the input image ID14_110 to generate an example output image ID14_130. For example, during a first iteration (e.g., a first location of a sliding window positioned and/or otherwise applied through the input image) the mask ID14_120 is overlaid with the input image ID14_110 and every value in that particular portion (e.g., window) of the input image ID14_110 is multiplied with every value in the mask ID14_120. Subsequent iterations (e.g., slide positions of the window) continue mask multiplication as the mask ID14_120 is moved to adjacent positions of the input image ID14_110. In the illustrated example of FIG. 14_1, the output image ID14_130 is the result of each pixel of the input image ID14_110 being replaced with a weighted sum of itself and nearby pixels. The weights of the weighted sum are defined by the example kernel ID14_120.
In some examples, the same weights in kernel ID14_120 are applied across every 3×3 section of the example input image ID14_110 (e.g., as a sliding window, etc.) to compute the individual pixel values of the example output image ID14_130. Further, an example CNN training process could apply multiple kernels similar to the example kernel ID14_120 (but with different weight values) to obtain multiple output images similar to the output image ID14_130 in each layer of the CNN.
It is noted that the sizes of the input image ID14_110 and the kernel ID14_120, the weight values in the kernel ID14_120, and the pixel values in the image ID14_110 could vary and are only illustrated as shown for the sake of example. Other sizes, weight values, and pixel values are possible.
As noted above, the CNN training process described in connection with FIG. ID14_1 may be associated with a relatively high computing power requirement depending on the number of kernels used, the number of CNN layers, window sizes, and/or the input image size. For each windowed operation, a single pixel output ID14_140 is calculated for the output image ID14_130. In the event two or more masks are to be applied to the input image ID14_110, then the windowed convolution operation must repeat across the entire input image ID14_110, thereby causing a substantial computational demand.
FIG. ID14_2A is a conceptual illustration of an example convolution operation using a dynamic kernel ID14_200. As shown, an example input image ID14_210 is represented similarly to input image ID14_110 as a grid of pixel values. In the example convolution operation of FIG. ID14_2A however, each section (e.g., section ID14_212, sometimes referred to as a window, a sliding window, or a kernel window) of the example input image ID14_210 is convolved with an example dynamic kernel ID14_220 that includes weight values adjusted according to the data in the image section ID14_212.
Specifically, multiple (static) kernels ID14_222, ID14_224, ID14_226, ID14_228, etc., are first individually convolved with image section ID14_212 (e.g., by computing a weighted sum of the center pixel and its adjacent pixels) to obtain a single output value. The illustrated example of FIG. ID14_2A includes nine (9) different static kernels (filters), but examples disclosed herein are not limited thereto. The individual output values from the convolution computations performed using the different static kernels ID14_222, ID14_224, ID14_226, ID14_228, etc., and are then combined (as weight values) into a generated dynamic kernel ID14_220 having a same size as each of the individual kernels ID14_222, ID14_224, ID14_226, ID14_228, etc. Unlike traditional approaches to convolve a portion of an input pattern with multiple kernels, each of which produce a single pixel output, examples disclosed herein generate a weight for the generated dynamic kernel ID14_220.
Next, the generated dynamic kernel ID14_220 is convolved with the same image section ID14_212 to generate a single output pixel ID14_230. Thus, instead of generating an output pixel for each one of kernels ID14_222, ID14_224, ID14_226, ID14_228, etc., a single pixel output is obtained using a single dynamic kernel.
In line with the discussion above, the weights in dynamic kernel ID14_220 may vary depending on the data content of a respective image section ID14_212 convolved with the dynamic kernel ID14_220. This is unlike traditional approaches to convolution, in which the same variety of kernels (e.g., ID14_222, ID14_224, etc.) are applied for each window of the input pattern. That is, a different section of image ID14_210 (e.g., after a windowed portion moves to the right) may have a different pattern (i.e., different pixel values) and thus result in different weight values in its correspondingly generated dynamic kernel ID14_220. As such, each convolution performed during each windowed portion of the input image results in a unique and/or otherwise dynamic kernel ID14_220 that is multiplied with each section of the input image ID14_212.
At least one benefit of examples disclosed herein results in better filters (sometimes referred to as “descriptors”) generated during image convolution. Briefly turning to FIG. ID14_2B, the example input image ID14_210 is shown with eight (8) example filters surrounding it. The illustrated example of FIG. ID14_2B includes a first dynamic filter ID14_250, a second dynamic filter ID14_252, a third dynamic filter ID14_254, a fourth dynamic filter ID14_256, a fifth dynamic filter ID14_258, a sixth dynamic filter ID14_260, a seventh dynamic filter ID14_262 and an eighth dynamic filter ID14_264. Unlike traditional convolution techniques that apply the same filters in a manner independent of specific portions of the input image ID14_210, example dynamic filters disclosed herein are uniquely generated based on each portion of the input image ID14_210. In other words, the dynamic filters/kernels disclosed herein change their behavior based on the input image so that every portion of the input image is convolved with a different filter. As a result, a reduction of terms results because fewer kernels and layers are needed.
Generation of dynamic filters disclosed herein permits a more efficient extraction of information from input images. For example, the first filter ID14_250 corresponds to a portion of the input image ID14_210 that is homogenous (e.g., includes only a single color of pixels within the window of interest). The example first dynamic filter ID14_250 permits the convolution process to learn to ignore such portions of the input image that contain no information, thereby allowing such regions to be skipped and/or otherwise prevents further computational resources from being applied to the analysis of such portions.
Additionally, through this example process, the number of kernels (and thus hyper parameters) used in each layer of a CNN training model implemented according to the example illustrated in FIG. ID14_2A can be less than a corresponding number of kernels (and thus hyper parameters) used in each layer of a traditional CNN training model implemented according to the example described in connection with FIG. ID14_1. Accordingly, examples disclosed herein reduce memory requirements for CNN models and reduce computational resources needed for such CNN models. While both the traditional CNN technique and dynamic kernel techniques disclosed herein converge to a solution (e.g., a solution having a particular percent accuracy metric), examples disclosed herein converge using fewer weights and fewer epochs. Furthermore, because dynamic filters are applied for each portion of the input image, accuracy of extracted information is improved.
Generally speaking, as the accuracy of descriptors increases, the corresponding need for a particular quantity of kernels and/or neurons in subsequent layers decreases. Table ID14_A illustrates a typical LeNet CNN topology, and Table ID14_B illustrates an example adaptive kernel topology disclosed herein.
As illustrated in example Table ID14_A, a typical LeNet CNN topology includes a first layer with twenty kernels and a second layer with fifty kernels. However, examples disclosed herein that employ adaptive kernels (e.g., Table ID14_B) enable a network topology where a first layer includes five kernels and a second layer with ten kernels. Furthermore, the example LeNet CNN topology requires 431,000 parameters to achieve greater than 99% accuracy, whereas example adaptive kernels disclosed herein require approximately 6500 parameters to achieve the same accuracy, thereby enabling substantially reduced memory requirements. Additionally, the commercially available LetNet5 model requires 60,000 parameters to achieve an accuracy greater than 99%, which is approximately nine times greater than the number of parameters required by examples disclosed herein to achieve a substantially similar (e.g., within 1-2%) accuracy. While the example “Best Practices” model by Patrice Y. Simard (2003) also achieves greater than 99% accuracy using four layers, that commercially available model requires over 132,000 parameters, which is twenty times greater than the quantity of parameters required by examples disclosed herein.
FIG. ID14_3 is a block diagram of example machine learning trainer circuitry ID14_300 implemented in accordance with teachings of this disclosure for training an adaptive convolutional neural network. As shown, the example machine language trainer ID14_300 includes an example training data datastore ID14_305, example dynamic kernel convolver circuitry ID14_310, an example model data datastore ID14_315, and example model provider circuitry ID14_320. The example machine learning trainer ID14_300 of FIG. ID14_3 is responsible for facilitating the examples disclosed in FIG. ID14_2A.
In some examples, the machine learning trainer circuitry ID14_300 includes means for machine learning training, means for dynamic kernel convolving and means for model providing. For example, the means for machine learning training may be implemented by the machine learning trainer circuitry ID14_300, the means for dynamic kernel convolving may be implemented by the dynamic kernel convolver circuitry ID14_310, and the means for model providing may be implemented by the model provider circuitry ID14_320. In some examples, the aforementioned structure may be implemented by machine executable instructions disclosed herein and executed by processor circuitry, which may be implemented by the example processor circuitry D212 of
The example training data datastore ID14_305 of the illustrated example of FIG. ID14_3 is implemented by any storage device (e.g., memory, structure, and/or storage disc for storing data such as, for example, flash memory, magnetic media, optical media, solid state memory, hard drive(s), thumb drive(s), etc). Furthermore, the data stored in the example training data datastore ID14_305 may be in any data format such as, for example, binary data, comma delimited data, tab delimited data, structured query language (SQL) structures, etc. In one example, the data stored in the example training data datastore ID14_305 may include image data (e.g., image files) that include training images representative of various patterns (e.g., edges, segments, etc.), similar to input image ID14_210, for example.
While in the illustrated example of FIG. ID14_3 the training data datastore ID14_305 is illustrated as a single device, the example training data datastore ID14_305 and/or any other data storage devices described herein may be implemented by any number and/or type(s) of memories. The example training data datastore ID14_305 stores data that is used by the example dynamic kernel convolver circuitry ID14_310 to train a model.
The example dynamic kernel convolver circuitry ID14_310 of the illustrated example of FIG. ID14_3 is implemented using a logic circuit such as, for example, a hardware processor. For example, the dynamic kernel convolver circuitry ID14_310 can be implemented using computer readable instructions that are executed by a processor to perform the functions of the dynamic kernel convolver circuitry ID14_310. Other types of circuitry may be additionally or alternatively used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), Coarse Grained Reduced precision architecture (CGRA(s)), image signal processor(s) (ISP(s)), etc.
In operation, the example dynamic kernel convolver circuitry ID14_310 generates a dynamic kernel for each section of training data model (e.g., image) by convolving the data section with multiple different kernels and combining the outputs of the convolved kernels to generate a single dynamic kernel for convolving that data section, in a manner consistent with the discussion corresponding to FIG. ID14_2A. Further, the example dynamic kernel convolver circuitry ID14_310 stores the results of the dynamic kernel convolution operation described above in the example model data datastore ID14_315.
The example model data datastore ID14_315 of the illustrated example of FIG. ID14_3 is implemented by any storage device (e.g., memory, structure, and/or storage disc for storing data such as, for example, flash memory, magnetic media, optical media, solid state memory, hard drive(s), thumb drive(s)). Furthermore, the data stored in the example model data datastore ID14_315 may be in any data format such as, for example, binary data, comma delimited data, tab delimited data, structured query language (SQL) structures, etc. In one specific example, the data stored in the example model datastore ID14_315 includes kernels (e.g., weight filters) trained to identify and/or classify features in image data (e.g., segments, background).
While in the illustrated example of FIG. ID14_3 the model data datastore ID14_315 is illustrated as a single device, the example model data datastore ID14_315 and/or any other data storage devices described herein may be implemented by any number and/or type(s) of memories. The example model data datastore ID14_315 stores information concerning a model trained by the root finder. Such information may include, for example, model hyperparameters, information concerning the architecture of the model, etc.
The example model provider circuitry ID14_320 of the illustrated example enables the example machine learning trainer ID14_300 to transmit and/or otherwise provide access to a model stored in the model data datastore ID14_315. In this manner, the model may be trained at the machine learning trainer ID14_300 (e.g., a first device), and be provided to another device (e.g., a second device) by the model provider circuitry ID14_320 via, for example, a network (e.g., the Internet) to allow the other device to utilize the model for inference.
While an example manner of implementing the machine learning trainer ID14_300 of
A flowchart representative of example hardware logic, machine-readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the machine learning trainer ID14_300 of FIG. ID14_3 is shown in FIG. ID14_4. The machine-readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor and/or processor circuitry, such as the processor D152 shown in the example processor platform D150 discussed above in connection with
The machine-readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine-readable instructions as described herein may be stored as data or a data structure (e.g., portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine-readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine-readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc. in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine-readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and stored on separate computing devices, wherein the parts when decrypted, decompressed, and combined form a set of executable instructions that implement one or more functions that may together form a program such as that described herein.
In another example, the machine-readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc. in order to execute the instructions on a particular computing device or other device. In another example, the machine-readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine-readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine-readable media, as used herein, may include machine-readable instructions and/or program(s) regardless of the particular format or state of the machine-readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine-readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine-readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example processes of FIG. ID14_4 may be implemented using executable instructions (e.g., computer and/or machine-readable instructions) stored on a non-transitory computer and/or machine-readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
FIG. ID14_4 is a flowchart representative of example machine-readable instructions ID14_400 which may be executed to implement the example machine learning trainer ID14_300 to train a CNN using dynamic kernels.
At block ID14_402, the example machine learning trainer ID14_300 obtains data to be analyzed. For example, the data to be analyzed may include an image retrieved from an Edge device that has not been identified, such as an image of a pedestrian, a car or a traffic sign. In some examples, the data is obtained from a memory, a storage or one or more devices having sensors (e.g., cameras, microphones, gyroscopes, etc. In some examples, the data to be analyzed is vibrational information from a smartwatch, in which a gesture needs to be identified. In some examples, the data to be analyzed is audio information within which one or more commands is to be identified. The example dynamic kernel convolver circuitry ID14_310 obtains any number of kernels (block ID14_404), such as static kernels typically employed in neural networks. In some examples, the static kernels are initially populated with random values and, in traditional neural network implementations such static kernels are modified during a learning process. The plurality of kernels may be similar to kernels ID14_120, ID14_222, ID14_224, ID14_226, and/or ID14_228. In one example, the dynamic kernel convolver circuitry ID14_310 obtains the kernels from the example model data datastore ID14_315. For example, initial weight values in the kernels may be computed (at least partially) using another neural network training model based on the same training data datastore and stored in the model data datastore ID14_305. In another example, the initial weight values in the kernels could be randomly generated and then adjusted during the training process of the neural network. Other examples are possible.
As described above, while examples disclosed herein begin with any number of static kernels, such static kernels are not employed in the traditional manner by convolution with each selected window of the input image.
The example convolution window shift circuitry ID14_312 positions a convolution window on an unexamined portion of an input image (block ID14_406). Returning briefly to the illustrated example of FIG. ID14_2A, the section ID14_212 represents a convolution window that is stepped through the input image ID14_210. The example dynamic kernel convolver circuitry ID14_310 multiplies (convolves) window image data corresponding to contents within the convolution window ID14_212 with a selected one of the static filters to generate a weight value for the example dynamic kernel (block ID14_408). Each generated/calculated output (e.g., pixel) may correspond to a weighted sum of the pixel values in the input image section ID14_212, where the weights are defined by the selected kernel. In the event the example dynamic kernel convolver circuitry ID14_310 determines that one or more additional static kernels has not yet been applied to the selected portion of the input image (block ID14_410), then the example dynamic kernel convolver circuitry ID14_310 selects a next available static kernel (block ID14_412) and control returns to block ID14_408.
On the other hand, in the event the dynamic kernel convolver circuitry ID14_310 determines that all static kernels have been applied to the selected portion of the input image (block ID14_410), the example model data datastore ID14_315 stores the dynamic kernel corresponding to the selected portion of the input image (block ID14_414). In some examples, the dynamic kernel convolver circuitry ID14_310 combines the output pixels to generate a dynamic kernel for the first section of the image. The example dynamic kernel convolver circuitry ID14_310 multiplies (convolves) the window image data from the input image with the dynamic kernel to generate a pixel output (block ID14_416). The example convolution window shift circuitry ID14_312 determines whether there are additional window positions within the input image that have not yet been convolved (block ID14_418). If so, control returns to block ID14_406, otherwise the example model provider circuitry ID14_320 saves the model to storage (block ID14_420).
From the foregoing, it will be appreciated that example methods, apparatus and articles of manufacture have been disclosed for adaptively training a convolutional neural network. The disclosed methods, apparatus and articles of manufacture improve the efficiency of using a computing device by reducing the number of kernels and/or layers used for training a convolutional neural network model. The disclosed methods, apparatus and articles of manufacture are accordingly directed to one or more improvement(s) in the functioning of a computer.
Further variation of examples disclosed above is provided by the following examples.
Example methods, apparatus, systems, and articles of manufacture to optimize resources in Edge networks are disclosed herein. Further examples and combinations thereof include the following: Example 145 includes an apparatus comprising at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus, a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations, or Application Specific Integrate Circuitry including logic gate circuitry to perform one or more third operations, the processor circuitry to at least one of perform at least one of the first operations, the second operations or the third operations to generate a first weight by convolving the portion of the input image with a first one of static kernels, generate a second weight by convolving the portion of the input image with a second one of the static kernels, in response to determining all available static kernels have been convolved with the portion of the input image, generate a dynamic kernel by combining the first weight with the second weight, generate an output pixel corresponding to the portion of the input image by convolving the portion of the input image with the dynamic kernel, and build a convolution model from respective ones of additional output pixels corresponding to respective ones of additional portions of the input image.
Example 146 includes the apparatus as defined in example 145, wherein the processor circuitry is to build the convolution model having a first layer depth value and a first quantity of parameters.
Example 147 includes the apparatus as defined in example 146, wherein the first layer depth of the convolution model is the same as a commercially available model, the first quantity of parameters of the convolution model lower than a second quantity of parameters of the commercially available model by a factor of at least nine.
Example 148 includes the apparatus as defined in example 146, wherein the processor circuitry is to build the convolution model with a first accuracy metric substantially equal to a second accuracy metric corresponding to a commercially available model.
Example 149 includes the apparatus as defined in example 145, wherein the processor circuitry is to access the input image with interface circuitry.
Example 150 includes At least one non-transitory computer readable storage medium comprising instructions that, when executed, cause at least one processor to at least generate a first weight by convolving the portion of the input image with a first one of static kernels, generate a second weight by convolving the portion of the input image with a second one of the static kernels, in response to determining all available static kernels have been convolved with the portion of the input image, generate a dynamic kernel by combining the first weight with the second weight, generate an output pixel corresponding to the portion of the input image by convolving the portion of the input image with the dynamic kernel, and build a convolution model from respective ones of additional output pixels corresponding to respective ones of additional portions of the input image.
Example 151 includes the computer readable storage medium as defined in example 150, wherein the instructions, when executed, cause the at least one processor to build the convolution model having a first layer depth value and a first quantity of parameters.
Example 152 includes the computer readable storage medium as defined in example 151, wherein the first layer depth of the convolution model is the same as a commercially available model, the first quantity of parameters of the convolution model lower than a second quantity of parameters of the commercially available model by a factor of at least nine.
Example 153 includes the computer readable storage medium as defined in example 152, wherein the instructions, when executed, cause the at least one processor to build the convolution model with a first accuracy metric substantially equal to a second accuracy metric corresponding to a commercially available model.
Example 154 includes the computer readable storage medium as defined in example 151, wherein the instructions, when executed, cause the at least one processor to access the input image with interface circuitry.
Example 155 includes a method comprising generating a first weight by convolving the portion of the input image with a first one of static kernels, generating a second weight by convolving the portion of the input image with a second one of the static kernels, in response to determining all available static kernels have been convolved with the portion of the input image, generating a dynamic kernel by combining the first weight with the second weight, generating an output pixel corresponding to the portion of the input image by convolving the portion of the input image with the dynamic kernel, and building a convolution model from respective ones of additional output pixels corresponding to respective ones of additional portions of the input image.
Example 156 includes the method as defined in example 155, further including building the convolution model having a first layer depth value and a first quantity of parameters.
Example 157 includes the method as defined in example 156, wherein the first layer depth of the convolution model is the same as a commercially available model, the first quantity of parameters of the convolution model lower than a second quantity of parameters of the commercially available model by a factor of at least nine.
Example 158 includes the method as defined in example 156, further including building the convolution model with a first accuracy metric substantially equal to a second accuracy metric corresponding to a commercially available model.
Example 159 includes the method as defined in example 156, further including accessing the input image via interface circuitry.
Example 160 includes an apparatus to generate a model comprising convolution window shift circuitry to position a kernel window to a portion of an input image, dynamic kernel convolver circuitry to generate a first weight by convolving the portion of the input image with a first one of static kernels, generate a second weight by convolving the portion of the input image with a second one of the static kernels, in response to determining all available static kernels have been convolved with the portion of the input image, generate a dynamic kernel by combining the first weight with the second weight, and generate an output pixel corresponding to the portion of the input image by convolving the portion of the input image with the dynamic kernel, and model provider circuitry to build a convolution model from respective ones of additional output pixels corresponding to respective ones of additional portions of the input image.
Example 161 includes the apparatus as defined in example 160, wherein the model provider circuitry is to build the convolution model having a first layer depth value and a first quantity of parameters.
Example 162 includes the apparatus as defined in example 161, wherein the first layer depth of the convolution model is the same as a commercially available model, the first quantity of parameters of the convolution model lower than a second quantity of parameters of the commercially available model by a factor of at least nine.
Example 163 includes the apparatus as defined in example 161, wherein the model provider circuitry is to build the convolution model with a first accuracy metric substantially equal to a second accuracy metric corresponding to a commercially available model.
Example 164 includes the apparatus as defined in example 160, further including interface circuitry to access the input image.
Artificial intelligence (AI), including machine learning (ML), deep learning (DL), and/or other artificial machine-driven logic, enables machines (e.g., computers, logic circuits, etc.) to use a model to process input data to generate an output based on patterns and/or associations previously learned by the model via a training process. For instance, the model may be trained with data to recognize patterns and/or associations and follow such patterns and/or associations when processing input data such that other input(s) result in output(s) consistent with the recognized patterns and/or associations.
Currently, the number of neurons or units in a Neural Network (NN) layer is manually defined by the network architect. Usually, this hyper parameter is based on the experience of the architect and following a trial-and-error process. Therefore, it is common to consider that the final NN topologies are somehow suboptimal.
The typical procedure is for the network architect to define the number of neurons manually, and to use a trial-and-error process with multiple trainings, until the user obtains a satisfactory balance between the number of neurons and the expected accuracy performance. Some existing techniques try to automatically tune the hyper parameters, performing multiple trainings through brute force, until the combination of hyper-parameters that yields the best performance is found.
Another technique is to use genetic algorithms. In this method, every model represents an element of the population and new models are generated combining the previous generations, and for each combination a new training is performed in order to identify the best topologies in the new population.
Any such existing technique uses multiple training iterations, which translates into high computational power expenditure over extended periods of time. Therefore, these methods suffer from low overall practical efficiency.
Example approaches disclosed herein utilize a second-order method to minimize the global loss error in a NN training, using fully connected layers, based on the usage of vertical and horizontal tangent parabolas. Example approaches disclosed herein expand the search area of zero-crossings in the error derivative function, quantifying the need for more or a smaller number of neurons in a fully connected layer in order to classify optimally the patterns in the training database.
In example approaches disclosed herein, the number of neurons converge to the number of roots of the derivative of the error function. As used herein, a root is defined as a numeric value corresponding to a zero crossing in a derivative of a function. In some examples, a function may have any number of roots. For example, a simple quadratic function may have a single root, whereas a more complex polynomial function might have multiple roots. That is, when two neurons converge to the same root, these will merge into a single neuron. Additionally, every neuron improves its position to better cover the training data distribution, or otherwise will split into two neurons, depending on its derivative function in each iteration.
Examples disclosed herein seek to minimize error of a NN model by relocating neurons in the fully connected layer to the roots of the derivative of the error (i.e. E′(u)=dE(u)/du). The local minima point of E is a root of E′. Such an approach does not require an initial definition of a search interval for the roots. Moreover, example approaches disclosed herein move neurons that are not in the minima neighborhood, reducing computational costs and therefore improving the model architecture. As a result, not only are weighting parameters of the NN trained, but at the same time the topology of the NN is improved, without the cost of having to train multiple topologies.
FIG. ID15_1 includes graphs illustrating iterations of the example process for finding roots of a function. The illustrated example of FIG. ID15_1 includes six graphs ID15_110, ID15_120, ID15_130, ID15_140, ID15_150, ID15_160. In each of the graphs ID15_110, ID15_120, ID15_130, ID15_140, ID15_150, ID15_160, a polynomial function ID15_105 is shown. In the illustrated example of FIG. ID15_1, the polynomial function ID15_105 can be represented by Equation ID15_1:
ƒ(x)=5x5+2x4−15x3+6x+1 Equation ID15_1
Throughout the six graphs/iterations ID15_110, ID15_120, ID15_130, ID15_140, ID15_150, ID15_160, roots of the polynomial function ID15_105 are found (e.g., represented as x1, x2, x3, x4, x5, x6). In the illustrated example of FIG. ID15_1, second order polynomial functions are added at tangent points of the polynomial function ID15_105. In an initial iteration, a starting point (e.g., x0) is randomly selected. In the illustrated example of FIG. ID15_1, the polynomials used at x0, can be used to generate x1 (a root) and x2. For this case, x2 is not a root but it helps to generate x3, that for the next iteration it finds another root and also x4, where x4 helps to find the root x6 and so on.
FIG. ID15_2 represents values of seven iterations of the root finding process associated with the polynomial function ID15_105 of FIG. ID15_1. In the illustrated example of FIG. ID15_2, seven iterations are illustrated representing a progression through the process of identifying roots. In the illustrated example of FIG. ID15_2, a checkmark next to a value indicates that the value was identified as a root. Thus, in the illustrated example of FIG. ID15_2, the five roots of the polynomial function ID15_105 of FIG. ID15_1 were identified in seven iterations.
Contrast this identification of the roots of the polynomial with a brute force searching technique which might require hundreds or thousands of iterations. Moreover, common brute forcing techniques apply ‘binning’ where roots are searched for within particular zones of the polynomial (e.g., between integer values). Such techniques are typically limited to identification of a single root within each zone. Thus, if in the context of the polynomial function ID15_105 of FIG. ID15_1, roots were search for between −2 and −1, −1 and 0, 0 and 1, 1 and 2, only four roots would have been identified, when there are in fact five roots.
FIG. ID15_3 is a block diagram of an example neural network. The example neural network includes a radial basis function (RBF) based fully connected layer ID15_310. In the illustrated example of FIG. ID15_3, a RBF NN is used, however any other type of network may additionally or alternatively be used. The illustrated example of FIG. ID15_3 shows the basic architecture in which the number of neurons in the hidden layer is k. Example approaches disclosed herein will increase and decrease the number of neurons depending on the number of roots detected in the derivative of the error function. In the illustrated example of FIG. ID15_3, μ represents a centroid of the hyper-spheroid (a gaussian for 2D), w represents the weights of the output layer, a represents the outputs from the Gaussian neurons and o is the output of the final layer.
During training of the neural network, the total error is given by:
Where oi is given by:
And aij is defined as:
a
ij
=e
−Σ
(x
−μ
)
Then, the error E partial derivative is:
Since we define our function as the derivative of the error:
Then, the derivative is denoted as:
And:
Where S is:
And:
V=[2ƒ′(ujl),−2ƒ(ujl),−ƒ′(ujl),−ƒ(ujl)]
Now we can compute the ratios:
The above ratios allow to the example root finder ID15_410 of FIG. ID15_4 to implement the process described below in connection with
FIG. ID15_4 is a block diagram of an example machine learning trainer 400 implemented in accordance with teachings of this disclosure to perform second order bifurcating for training of a machine learning model. The example machine learning trainer 400 includes a training data datastore ID15_405, a root finder ID15_410, a root tester ID15_412, a model data datastore ID15_415, and a model provider ID15_420.
The training data datastore ID15_405 of the illustrated example of FIG. ID15_4 is implemented by any storage device (e.g., memory, structure, and/or storage disc for storing data such as, for example, flash memory, magnetic media, optical media, solid state memory, hard drive(s), thumb drive(s), etc.). Furthermore, the data stored in the example training data datastore ID15_405 may be in any data format such as, for example, binary data, comma delimited data, tab delimited data, structured query language (SQL) structures, etc. While, in the illustrated example, the training data datastore ID15_405 is illustrated as a single device, the example training data datastore ID15_405 and/or any other data storage devices described herein may be implemented by any number and/or type(s) of memories. The example training data datastore ID15_405 stores data that is used by the example root finder ID15_410 to train a model.
The example root finder ID15_410 of the illustrated example of FIG. ID15_4 is implemented by a logic circuit such as, for example, a hardware processor. However, any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), Coarse Grained Reduced precision architecture (CGRA(s)), image signal processor(s) (ISP(s)), etc. The example root finder ID15_410 of the illustrated example of FIG. ID15_4 uses second order bifurcation to attempt to identify potential values of roots to be tested by the root tester ID15_412. As used herein, the roots of the function represent neurons of a neural network. In some examples, the root finder ID15_410 uses the identified roots to train a machine learning model. The example root finder ID15_410 stores the results of the machine learning model training in the example model data datastore ID15_415.
The example root tester ID15_412 of the illustrated example of FIG. ID15_4 is implemented by a logic circuit such as, for example, a hardware processor. However, any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), Coarse Grained Reduced precision architecture (CGRA(s)), image signal processor(s) (ISP(s)), etc. The example root tester ID15_412 of the illustrated example of FIG. ID15_4 evaluates error values to determine whether a particular value (e.g., one or more input values to a function) identifies a root of a function.
The example model data datastore ID15_415 of the illustrated example of FIG. ID15_4 is implemented by any storage device (e.g., memory, structure, and/or storage disc for storing data such as, for example, flash memory, magnetic media, optical media, solid state memory, hard drive(s), thumb drive(s), etc.). Furthermore, the data stored in the example model data datastore ID15_415 may be in any data format such as, for example, binary data, comma delimited data, tab delimited data, structured query language (SQL) structures, etc. While, in the illustrated example, the model data datastore ID15_415 is illustrated as a single device, the example model data datastore ID15_415 and/or any other data storage devices described herein may be implemented by any number and/or type(s) of memories. The example model data datastore ID15_415 stores information concerning a model trained by the root finder ID15_410. Such information may include, for example, model hyperparameters, information concerning the architecture of the model, etc.
The example model provider ID15_420 of the illustrated example enables the example machine learning trainer ID15_400 to transmit and/or otherwise provide access to a model stored in the model data datastore ID15_415. In this manner, the model may be trained at the machine learning trainer ID15_400 (e.g., a first device), and be provided to another device (e.g., a second device) by the model provider ID15_420 via, for example, a network (e.g., the Internet) to allow the other device to utilize the model for inference. In other words, the example machine learning trainer ID15_400 may be implemented to perform training at an edge device, an IoT device, a cloud server, or any other computing device capable of training a machine learning model.
While an example manner of implementing the machine learning trainer ID15_400 is illustrated in FIG. ID15_4, one or more of the elements, processes and/or devices illustrated in FIG. ID15_4 may be combined, divided, re-arranged, omitted, eliminated and/or implemented in any other way. Further, the example root finder ID15_410, the example root tester ID15_412, the example model provider ID15_420, and/or, more generally, the example machine learning trainer ID15_400 of FIG. ID15_4 may be implemented by hardware, software, firmware and/or any combination of hardware, software and/or firmware. Thus, for example, any of the example root finder ID15_410, the example root tester ID15_412, the example model provider ID15_420, and/or, more generally, the example machine learning trainer ID15_400 of FIG. ID15_4 could be implemented by one or more analog or digital circuit(s), logic circuits, programmable processor(s), programmable controller(s), graphics processing unit(s) (GPU(s)), digital signal processor(s) (DSP(s)), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)) and/or field programmable logic device(s) (FPLD(s)). When reading any of the apparatus or system claims of this patent to cover a purely software and/or firmware implementation, at least one of the example root finder ID15_410, the example root tester 412, the example model provider ID15_420, and/or, more generally, the example machine learning trainer ID15_400 of FIG. ID15_4 is/are hereby expressly defined to include a non-transitory computer readable storage device or storage disk such as a memory, a digital versatile disk (DVD), a compact disk (CD), a Blu-ray disk, etc. including the software and/or firmware. Further still, the example machine learning trainer ID15_400 of FIG. ID15_4 may include one or more elements, processes and/or devices in addition to, or instead of, those illustrated in FIG. ID15_4, and/or may include more than one of any or all of the illustrated elements, processes and devices. As used herein, the phrase “in communication,” including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, and does not require direct physical (e.g., wired) communication and/or constant communication, but rather additionally includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or one-time events.
Flowcharts representative of example hardware logic, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the machine learning trainer ID15_400 of FIG. ID15_4 are shown in FIGS. ID15_5, ID15_6, ID15_7, and/or ID15_8. The machine readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor and/or processor circuitry, such as the processor D104 shown in the example compute node D100 discussed below in connection with
The machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc. Machine readable instructions as described herein may be stored as data or a data structure (e.g., portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions. For example, the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers) located at the same or different locations of a network or collection of networks (e.g., in the cloud, in edge devices, etc.). The machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc. in order to make them directly readable, interpretable, and/or executable by a computing device and/or other machine. For example, the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and stored on separate computing devices, wherein the parts when decrypted, decompressed, and combined form a set of executable instructions that implement one or more functions that may together form a program such as that described herein.
In another example, the machine readable instructions may be stored in a state in which they may be read by processor circuitry, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc. in order to execute the instructions on a particular computing device or other device. In another example, the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part. Thus, machine readable media, as used herein, may include machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.
The machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc. For example, the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
As mentioned above, the example process(es) of FIGS. ID15_6, ID15_5, ID15_7, and/or ID15_8 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on a non-transitory computer and/or machine readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
FIG. ID15_5 is a flowchart representing example machine-readable instructions that may be executed to train a machine learning model using the roots identified by the root finder of FIG. ID15_4. The example process of FIG. ID15_6 begins when the root finder ID15_410 accesses training data for training of a machine learning model. (Block ID15_510). In examples disclosed herein, the training data is stored in the training data datastore ID15_405. In some examples, the training data is stored in the training data store ID15_405 in response to receipt of the training data from another node (e.g., another node in the edge computing network). The example root finder ID15_410 and/or the example root tester ID15_412 discover roots as a function of the training data, as described in the root finding process of FIGS. ID15_6, ID15_7, and/or ID15_8, below. (Block ID15520). The example root finder ID15_410 determines whether root discovery is to continue. (Block ID15_530). Such a determination may be made based on, for example, whether a threshold number of roots have been found, whether a threshold amount of time has been spent attempting to identify roots, whether a threshold number of iterations of the root discovery process have been executed, and/or any other features or combination(s) thereof. If root discovery is to continue (e.g., block ID15_530 returns a result of YES), control proceeds to block ID15_520 where execution of the root discovery process is continued.
If root discovery is complete (e.g., block ID15_530 returns a result of NO), the example root finder ID15_410 utilizes the discovered root(s) as neurons in the training of a machine learning model. (Block ID15_540). In this manner, additional extraneous neurons that would have otherwise been included in the trained machine learning model can be avoided by use of the discovered roots. The example root finder ID15_410 then stores the trained machine learning model in the model data datastore ID15_415. (Block ID15_550).
In the illustrated example of FIG. ID15_5, the example model provider ID15_420 provides the stored machine learning model to other device(s) for execution. (Block ID15_560). In this manner, the model can be trained at a first location (e.g., a central node in an edge network), and be distributed to a second location(s) (e.g., edge nodes in the edge network) so that the machine learning model can be executed at those second location(s). However, in some examples, the model is trained and executed at the same node. Utilizing the training process based on root discovery disclosed herein enables more efficient discovery of nodes than previous techniques, thereby allowing training to potentially also be carried out at nodes with less compute resources (e.g., less memory, less processor resources, etc.). Such an approach enables models to be trained (e.g., re-trained, updated, etc.) more frequently. The example process of FIG. ID15_5 then terminates, but may be re-executed upon, for example, an indication that training is to occur, an indication that new training data is present, errors being detected in the trained model, a threshold amount of time elapsing since a prior training, etc.
FIG. ID15_6 is a flowchart representing example machine-readable instructions that may be executed to cause the example root finder and/or root tester to find roots of a function. The illustrated example of FIG. ID15_6 represents example operations performed by the machine learning trainer 400 of FIG. ID15_4. In examples disclosed herein, the process illustrated in FIG. ID15_6 is multithreaded, and uses at least an initial value for every weight to start the exploration of the derivative of the NN model error in order to find the roots of the function, and to move a neuron weight at each local minimum.
The example process ID15_600 of FIG. ID15_6 begins when the example root finder ID15_410 selects a starting point for identification of roots of a function. (Block ID15_605). In examples disclosed herein, the starting point is initialized to a random value. However, any other value may additionally or alternatively be used. The example root finder ID15_410 identifies a potential value of a root that is to be analyzed (e.g., a value is selected for a determination of whether it is a root). (Block ID15510). In an initial iteration, the potential value to be analyzed is the starting point selected at block ID15_505. In subsequent iterations, the potential value(s) to be analyzed may be based on stored values resulting from a prior iteration of analysis (e.g., as described in connection with block ID15_630, below). The root tester ID15_412 determines if a value of the function is less than an error threshold. (Block ID15_615). If so (e.g., block ID15_615 returns a result of YES), the example root tester 412 adds the identified tested value as a root. (Block ID15_620). The example root tester ID15_410 determines whether additional roots should be searched and/or tested. (Block ID15_622). If the search for additional roots is to continue (e.g., block ID15_622 returns a result of YES), control proceeds to block ID15_610. If no additional searching is to be performed (e.g., block ID15_622 returns a result of NO), the process 600 terminates (e.g., control returns to block ID15_520 of FIG. ID15_5).
Returning to block ID15_615, if block ID15_615 returns a result of NO, the example root finder ID15_410 uses second order bifurcation to attempt to identify one or more potential locations of roots to be analyzed in a subsequent iteration. (Block ID15_630). In some examples, the root finder ID15_410 considers whether a potential value had previously been analyzed and, if so, avoids re-analysis of the potential value. The example process iteratively continues until the root discovery process is complete.
FIG. ID15_7 is a flowchart representing example machine-readable instructions that may be executed to cause the example root finder 410 to find roots of a function. The illustrated example of FIG. ID15_7 represents example operations performed by the machine learning trainer ID15_400 of FIG. ID15_4. FIG. ID15_8 is a flowchart representing example mathematical operations corresponding to the machine-readable instructions of FIG. ID15_7. The illustrated example of FIG. ID15_8 illustrates mathematical representations ID15_800 of those operations of FIG. ID15_7. In examples disclosed herein, the process illustrated in FIG. ID15_7 is multithreaded, and uses at least an initial value for every weight to start the exploration of the derivative of the NN model error in order to find the roots of the function, and to move a neuron weight at each local minimum. In general, the process is based in the derivative ratios r1, r2, r3 generated by the derivatives of the function ƒ(x), defined as:
r
1=ƒ(a)/ƒ′(a),r2=ƒ(a)/ƒ″(a),r3=ƒ′(a)/ƒ″(a)
Where ƒ(a) represents the derivative of the NN model error. Depending on the values of ƒ′(a) and ƒ″(a) there are four different scenarios as described in the flowchart of FIG. ID15_7.
The example process ID15_700 of FIG. ID15_7 begins when the example root finder ID15_410 selects a starting point for identification of roots of a function ƒ(a). (Block ID15_705). An example equation for implementing block ID15_705 of FIG. ID15_7 is shown in block ID15_805 of FIG. ID15_8. In examples disclosed herein, the starting point (x0) is initialized to a random value. However, any other value may additionally or alternatively be used. The example root finder ID15_710 identifies an a potential value of a root that is to be analyzed (e.g., a value is selected for a determination of whether it is a root). (Block ID15_710). An example equation for implementing block ID15_710 of FIG. ID15_7 is shown in block ID15_810 of FIG. ID15_8.
In an initial iteration, the potential value to be analyzed is the starting point selected at block ID15_705. In subsequent iterations, the potential value(s) to be analyzed may be based on stored values resulting from a prior iteration of analysis (e.g., as described in connection with block ID15_750, below). The root tester ID15_412 determines if a value of the function is less than an error threshold. (Block ID15_715). An example equation for implementing block ID15_715 of FIG. ID15_7 is shown in block ID15_815 of FIG. ID15_8.
If the value of the function is less than the error threshold (e.g., block ID15_715 returns a result of YES), the example root finder ID15_410 adds the identified value of x0 as a root. (Block ID15_720). The example root finder ID15_410 determines whether additional roots should be searched. (Block ID15_722). If the search for additional roots is to continue (e.g., block ID15_722 returns a result of YES), control proceeds to block ID15_710. If no additional searching is to be performed (e.g., block ID15_722 returns a result of NO), the process ID15_700 terminates.
Returning to block ID15_715, if block ID15_715 returns a result of NO, the example root finder ID15_410 determines whether the second derivative of the function (e.g., ƒ′(a)) equals zero. (Block ID15_725). An example equation for implementing block ID15_725 of FIG. ID15_7 is shown in block ID15_825 of FIG. ID15_8.
If the second derivative of the function equals zero (Block 725 returns a result of YES), the example root finder ID15_410 determines whether the first derivative (e.g., ƒ′(a)) equals zero. (Block 730). An example equation for implementing block ID15_730 of FIG. ID15_7 is shown in block ID15_830 of FIG. ID15_8.
If the first derivative does not equal zero (block ID15_730 returns a result of NO), the example root finder ID15_410 calculates the first derivative ratio (e.g., r1) (Block ID15_735), and sets a first point (x1) equal to the index (a) minus the first derivative ratio (e.g., r1). (Block ID15_740). Example equations for implementing blocks ID15_735 and ID15_740 of FIG. ID15_7 are shown in blocks ID15_835 and ID15_840 of FIG. ID15_8, respectively.
If the first derivative equals zero (e.g., Block ID15_730 returns a result of SINGULAR or YES), the example root finder ID15_410 sets a first point and a second point (x1,2) equal to a±1. (Block ID15_745). An example equation for implementing block ID15_745 of FIG. ID15_7 is shown in block ID15_845 of FIG. ID15_8. Control then proceeds from blocks ID15_740 or 745 to block ID15_750, where the example root finder ID15_410 stores the first point and/or the second point, thereby identifying subsequent points to be searched for roots. (Block ID15_750). In some examples, the first point is stored as L[n], and the second point (if set) is stored as L[n+1]. An example equation for implementing block ID15_750 of FIG. ID15_7 is shown in block ID15_850 of FIG. ID15_8.
Control then proceeds to block ID15_710, where the process is repeated. In such an example, multiple additional threads may be created as part of the subsequent searches. For example, if both a first and second point were identified (e.g., both L[n] and L[n+1] were stored), two additional threads might be created to facilitate the searching of the roots at those locations. In some alternative examples, the existing thread may be re-used for one of the additional searches, and a second (new) thread may be created. The subsequently searched values and/or the threads in which those subsequently searched values are searched may result in further locations to analyze to determine if a root has been found.
In some examples, when identifying a value to be analyzed to determine if the value is a root, the example root finder ID15_410 considers whether potential values had previously been tested. If a potential value (or a value within a threshold distance of the potential value) had previously been tested, not repeating the test of that value avoids the possibility of infinite loops being created. For example, in the illustrated example of FIG. ID15_3, an initial value of −1.9 does not find a root, but results in potential values for roots at −1.849 and −1.41, which are analyzed in a first subsequent iteration. In the first subsequent iteration, the value −1.849 is confirmed as a root, and the value −1.41 is identified as not being a root. The testing of the value −1.41 results in potential values for roots at −2.02 and −0.93, which are analyzed in a second subsequent iteration. In the second subsequent iteration, the value −2.02 does not result in identification of a root, but does result in potential values for roots at −1.849 and 1.56. The value 1.56 is analyzed in a third subsequent iteration. However, as the value −1.849 was previously analyzed (regardless of the identification of the value −1.849 being identified as a root), the value −1.849 is not included in the third subsequent iteration. In this manner, previously analyzed values of potential roots do not result in additional threads being created.
Moreover, in some examples, had the root identifier ID15_410, after analysis of the value −2.02 in the second subsequent iteration, determined that the value −1.848 (as opposed to −1.849) should be tested, the root identifier ID15_410 may determine that the value −1.848 should not be tested, as it is within a threshold distance from a previously tested value. In some examples, using a different (e.g., larger) threshold distance may reduce computing time (and, as an extension, compute resource requirements of the machine learning trainer ID15_400), as additional similar values are not analyzed, at the expense of potentially missing identifications of roots that are close to each other. In contrast, using a smaller threshold distance may increase computing time (and, as an extension, compute resource requirements of the machine learning trainer ID15_400), as additional similar values are analyzed, thereby potentially identifying additional roots that might not have otherwise been discovered (resulting in a more accurate machine learning model than if those roots had not been discovered).
Returning to block ID15_725 of FIG. ID15_7, if the second derivative (e.g., ƒ′(a)) is not equal to zero (e.g., Block ID15_725 returns a result of NO), the example root finder ID15_410 calculates a third derivative ratio (r3). (Block ID15_755). An example equation for implementing block ID15_755 of FIG. ID15_7 is shown in block ID15_855 of FIG. ID15_8. The example root finder ID15_410 determines if either the first derivative of the function (ƒ′(a)) or the second derivative of the function (ƒ′(a)) are less than the error threshold. (Block ID15_760). An example equation for implementing block ID15_760 of FIG. ID15_7 is shown in block ID15_860 of FIG. ID15_8.
If block ID15_760 returns a result of YES, the example root finder 410 sets the first point (x1) based on the tested point (a) and half of the third derivative ration (e.g., equal to a+(½) r3). (Block ID15_765). An example equation for implementing block ID15_765 of FIG. ID15_7 is shown in block ID15_865 of FIG. ID15_8. Control then proceeds to block ID15_750.
If block ID15_760 returns a result of NO, the example root finder 410 calculates a second derivative ratio (r2). (Block ID15_770). The example root finder ID15_410 determines whether the square of the third derivative ratio (e.g., r3 squared) is greater than or equal to twice the second derivative ratio e.g., 2r2). (Block ID15_775). An example equation for implementing block ID15_775 of FIG. ID15_7 is shown in block ID15_875 of FIG. ID15_8. If block ID15_775 evaluates to false (e.g., Block ID15_775 returns a result of NO), the example root finder ID15_410 sets the first point (x1) based on the tested point (a) and the third derivative ratio (e.g., equal to a−r3). (Block ID15_780). An example equation for implementing block ID15_780 of FIG. ID15_7 is shown in block ID15_880 of FIG. ID15_8. Otherwise (e.g., if Block ID15_775 returns a result of YES), the root finder ID15_410 calculates the first and second points using the equation based on the tested point, the second derivative ratio, and the third derivative ratio (e.g., x1,2=a−r3±√{square root over (r32−2r2)}). (Block ID15_785). An example equation for implementing block ID15_785 of FIG. ID15_7 is shown in block ID15_885 of FIG. ID15_8. Control then proceeds to block ID15_750. As noted above, at block ID15_750, the example root finder ID15_410 stores the first point x1 as L[n], and the second point (if set) x2 as L[n+1], thereby identifying subsequent points to be searched for roots. (Block ID15_750). Control then proceeds to block ID15_710, where the process is repeated. In such an example, multiple additional threads may be created as part of the subsequent searches. For example, if both L[n] and L[n+1] were stored, two additional threads might be created to facilitate the searching of the roots at those locations. In some alternative examples, the existing thread may be re-used for one of the additional searches, and a second (new) thread may be created.
The example process ID15_700 of FIG. ID15_7 continues until the root discovery process is complete. The example process may be considered complete when, for example, all threads have resulted in identification of a root and no additional locations exist for testing to determine if the location represents a root. However, any other approach to determining when the root discovery process is considered complete may additionally or alternatively be used including, for example, when a threshold number of roots have been discovered (e.g., five roots, ten roots, twenty roots), when a threshold number of potential root locations have been tested (e.g., twenty potential root locations tested, one hundred potential root locations tested, one thousand potential root locations tested, etc.), a threshold amount of time has elapsed during the root discovery process, a threshold amount of compute resources have been used in the root discovery process, etc.
In further examples, any of the compute nodes or devices discussed with reference to the present edge computing systems and environment (e.g., the machine learning trainer ID15_400 of FIG. ID15_4) may be fulfilled based on the components depicted in
FIG. ID15_10 represents an experiment to train a neural network to generate a particular output. The example images illustrated in
In the illustrated example of FIG. ID15_10, the NN is trained to classify between the red dots and the black dots (see plots below). White circles represent each 2D RBF neuron distribution. As can be seen in the sequence of plots of FIG. ID15_10 including, in order, a first plot ID15_1005, a second plot ID15_1010, a third plot ID15_1015, a fourth plot ID15_1020, and a fifth plot ID15_1025, the number of neurons decreases while the NN is being trained and the error decreases, based on merging the neurons on the same E′ root. A final plot ID15_1030 (black and yellow) shows the evaluation of the network pixel by pixel, illustrating a final optimized number of resources (9 neurons) for this problem.
From the foregoing, it will be appreciated that example methods, apparatus and articles of manufacture have been disclosed that enable training of a neural network by not only adjusting weights of the neural network, but also automatically adjusting the number of neurons inside a fully connected layer during the training process by introducing quadratic functions (i.e. the tangent vertical and Horizontal parabola).
By implementing this methodology, it is possible to converge faster and more robustly to the roots of the derivative of the error function. As a result, training of a neural network can be accomplished more efficiently on lower power devices, such as edge computing nodes, as opposed to high-powered centralized servers.
At each one of the roots found, a neuron is located, and when two neurons converge into the same root, these get merged to have only one neuron. This constitutes the base mechanism to optimize the number of neurons needed by the final NN topology. In the same fashion, if the algorithm detects that the presence of a larger number of roots, is possible to generate new neurons by splitting the existing ones in the layer (similar to the biological mitosis process). These two processes ensure an appropriate number of neurons will be obtained at the end of the training stage. Such an approach typically reduces the number of neurons needed for implementing a neural network. As a result, smaller neural network models are created. Using smaller neural network models enables inference using those models on lower power computing devices (e.g., edge nodes) as well as reduces communications requirements needed for providing those models to those lower power computing devices. As a result, examples disclosed herein help to find improved topologies of a NN model in order to save computing resources during training, and provide an expected performance when deployed in the inference.
Although certain example methods, apparatus and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims of this patent.
Example methods, apparatus, systems, and articles of manufacture to optimize resources in Edge networks are disclosed herein. Further examples and combinations thereof include the following:
Example 165 includes an apparatus to train a machine learning model, the apparatus including memory, instructions, and at least one processor to execute the instructions to cause the at least one processor to at least access training data for the training of the machine learning model, iterate over possible locations of roots using second order bifurcation to determine whether respective ones of the possible locations of the roots are roots of the training data, and create the machine learning model based on the respective ones of the possible locations of the roots that are determined to be roots of the training data.
Example 166 includes the apparatus of example 165, wherein the training data is represented by a function, and to determine whether a first location of the possible locations of the roots is a root, the at least one processor is to execute the instructions to cause the at least one processor to determine if a value of function at the first location is less than an error threshold, and in response to determining that the value of the function is less than the error threshold, record an indication that the first location is a first root.
Example 167 includes the apparatus of example 166, wherein the at least one processor is to execute the instructions to cause the at least one processor to, in response to the determination that the value of the function is not less than the error threshold, identify a second location, the second location to be used in a subsequent iteration to determine whether the second location is a second root.
Example 168 includes the apparatus of example 167, wherein the at least one processor is to execute the instructions to cause the at least one processor to not use the second location in the subsequent iteration if the second location is within a threshold distance of the first root.
Example 169 includes the apparatus of example 167, wherein the at least one processor is to execute the instructions to cause the at least one processor to, in response to the determination that the value of the function is not less than the error threshold, identify a third location, the third location to be used in the subsequent iteration to determine whether the third location is a third root.
Example 170 includes the apparatus of any one of examples 167-169, wherein the at least one processor is to execute the instructions to cause the at least one processor to identify the second location based on a first derivative of the function at the first location and a second derivative of the function at the first location.
Example 171 includes At least one computer readable storage medium comprising instructions that, when executed, cause at least one processor to at least access training data for the training of a machine learning model, iterate over possible locations of roots using second order bifurcation to determine whether respective ones of the possible locations of the roots are roots of the training data, and create the machine learning model based on the respective ones of the possible locations of the roots that are determined to be roots of the training data.
Example 172 includes the at least one computer readable storage medium of example 171, wherein the training data is represented by a function, and to determine whether a first location of the possible locations of the roots is a root, the instructions cause the at least one processor to determine if a value of function at the first location is less than an error threshold, and in response to determining that the value of the function is less than the error threshold, record an indication that the first location is a first root.
Example 173 includes the at least one computer readable storage medium of example 172, wherein the instructions cause the at least one processor to, in response to the determination that the value of the function is not less than the error threshold, identify a second location, the second location to be used in a subsequent iteration to determine whether the second location is a second root.
Example 174 includes the at least one computer readable storage medium of example 173, wherein the instructions cause the at least one processor to not use the second location in the subsequent iteration if the second location is within a threshold distance of the first root.
Example 175 includes the at least one computer readable storage medium of example 173, wherein the instructions cause the at least one processor to, in response to the determination that the value of the function is not less than the error threshold, identify a third location, the third location to be used in the subsequent iteration to determine whether the third location is a third root.
Example 176 includes the at least one computer readable storage medium of any one of examples 173-175, wherein the instructions cause the at least one processor to identify the second location based on a first derivative of the function at the first location and a second derivative of the function at the first location.
Example 177 includes a method of training a machine learning model, the method including accessing training data for the training of the machine learning model, iterating over possible locations of roots using second order bifurcation to determine whether respective ones of the possible locations of the roots are roots of the training data, and creating the machine learning model based on the respective ones of the possible locations of the roots that are determined to be roots of the training data.
Example 178 includes the method of example 177, wherein the training data is represented by a function, and the determining of whether a first location of the possible locations of the roots includes determining if a value of function at the first location is less than an error threshold, and in response to determining that the value of the function is less than the error threshold, recording an indication that the first location is a first root.
Example 179 includes the method of example 178, further including, in response to determining that the value of the function is not less than the error threshold, identifying a second location, the second location to be used in a subsequent iteration to determine whether the second location is a second root.
Example 180 includes the method of example 179, further including not using the second location in the subsequent iteration if the second location is within a threshold distance of the first root.
Example 181 includes the method of example 179, further including, in response to determining that the value of the function is not less than the error threshold, identifying a third location, the third location to be used in the subsequent iteration to determine whether the third location is a third root.
Example 182 includes the method of any one of examples 179-181, wherein the identification of the second location is based on a first derivative of the function at the first location and a second derivative of the function at the first location.
Example 183 includes an apparatus for training of a machine learning model, the apparatus including a root finder to iterate over possible locations of roots within training data stored in a training data datastore, the iteration performed using second order bifurcation to determine whether respective ones of the possible locations of the roots are roots of the training data, wherein the root finder is to create the machine learning model based on the respective ones of the possible locations of the roots that are determined to be roots of the training data, and a model data datastore to store the machine learning model created by the root finder.
Example 184 includes the apparatus of example 183, wherein the training data is represented by a function, and wherein to determine whether a first location of the possible locations of the roots, the root finder is to determine if a value of function at the first location is less than an error threshold, and in response to determining that the value of the function is less than the error threshold, record an indication that the first location is a first root.
Example 185 includes the apparatus of example 184, wherein the root finder is to, in response to the determination that the value of the function is not less than the error threshold, identify a second location, the second location to be used in a subsequent iteration to determine whether the second location is a second root.
Example 186 includes the apparatus of example 185, wherein the root finder is to not use the second location in the subsequent iteration if the second location is within a threshold distance of the first root.
Example 187 includes the apparatus of example 185, wherein the root finder is to, in response to the determination that the value of the function is not less than the error threshold, identify a third location, the third location to be used in the subsequent iteration to determine whether the third location is a third root.
Example 188 includes the apparatus of any one of examples 185-187, wherein the root finder is to identify the second location based on a first derivative of the function at the first location and a second derivative of the function at the first location.
Neural Network (NN) size is a limiting factor when deploying learning algorithms on edge devices that are strapped for power, memory, bandwidth handling and/or computing resources. Increasing sizes of NNs may improve certain aspects of task performance (e.g., image recognition) while hindering other aspects of task performance (e.g., latency). To achieve low latency and high throughput, NN sizes may be constrained using different techniques, such as reducing a number of non-zero weights (e.g., pruning, sparsification), lowering a bit-width of weights and activations, etc. In some examples, uniform symmetric or uniform asymmetric quantization techniques are applied to reduce NN sizes. Typically, uniform symmetric quantization techniques realize the largest benefit when specific hardware devices are used that can accommodate clean bit shift operations (e.g., AVX 512 x86 instruction set architecture (ISA)), but examples disclosed herein are not limited to any specific ISA and/or hardware. On the other hand, non uniform quantization utilizes a dictionary in which keys include relatively lower bit-width representations of values. Such dictionaries bode well for reducing size constraints, but require dictionary lookup overhead, and such dictionaries themselves can become large. Examples disclosed herein discover and accelerate inference of dictionary-based weighting with non-uniform quantized NNs.
Prior efforts to discover dictionaries rely on sequential search-and-replace decompression techniques and/or reconfigurable hardware. Some prior efforts involve complex initializing of prior mixtures of Gaussians and learning mixture parameters of both those Gaussians and network weights via maximum likelihood techniques. Further complex posterior associations are required with a Dirac distribution up to machine precision to reduce quantization error with post fine-tuning. Still other techniques take pre-trained networks and learn dictionary values by gradient propagation, but such techniques have a substantially limited centroid search space, rather than updating the weights themselves as disclosed herein. Further drawbacks to prior techniques include, but are not limited to requirement of highly specific FPGAs and/or ASICs, which accompany a relatively high engineering cost. Example weights disclosed herein include one or more values represented by alphanumeric characters. Such values may be stored in one or more data structures, in which example data structures include integers, floating point representations and/or characters. Weights and corresponding values to represent such weights represent data stored in any manner. Such data may also propagate from a first data structure to a second or any number of subsequent data structures along a data path, such as a bus.
Unlike performance issues of prior techniques, examples disclosed herein achieve desired accuracy with two epochs rather than a typical 100 (or more) epochs. Examples disclosed herein achieve reduced bit-precision inferences, thereby reducing memory bandwidth bottleneck constraints. Generally speaking, dictionary-based weight sharing is a superset of all quantization techniques, such as power-of-two methods, uniform symmetric methods, and uniform asymmetric. Such techniques are computing device agnostic.
FIG. ID24_B1 illustrates an example matrix compression framework ID24_B250. In the illustrated example of FIG. ID24_B1, a first tensor ID24_B252, a second tensor ID24_B254, a third tensor ID24_B256 and a fourth tensor ID24_B258 each include four values. While the illustrated example of FIG. ID24_B1 includes values as INT8, examples disclosed herein are not limited thereto. The example first tensor ID24_B252 and the example second tensor ID24_B254 include four separate unique values, some of which are repeated in different cells. A similar example circumstance occurs with regard to the example third tensor ID24_B256 and the example fourth tensor ID24_B258. As disclosed in further detail below, one or more dictionaries are generated for any number of tensors (e.g., a matrix of weight values). Absent any effort to compress the example tensors, their combined representation consumes 16 bytes of memory.
In view of the example first tensor ID24_B252 and the example second tensor ID24_B254 exhibiting a common set of four unique weight values (e.g., determined via one or more clustering techniques/algorithms), a first dictionary ID24_B260 is generated. In particular, each unique weight value is associated with a particular binary representation, referred to herein as a key. Because there are four (4) unique weight values in the aforementioned example first and second tensors, two bits are able to fully represent each key in the combination of all weight values. Accordingly, representations of each tensor occur by way of substituted key values to generate a respective compressed tensor. In the illustrated example of FIG. ID24_B1, a first compressed tensor ID24_B262, a second compressed tensor ID24_B264, a third compressed tensor ID24_B266 and a fourth compressed tensor ID24_B268 are represented with 2-bit values rather than INT8 values, thereby conserving an amount of memory required for representation/storage. Accordingly, the representation of all four example tensors now consumes 12 bytes of memory (rather than 16 bytes in their uncompressed representation).
FIG. ID24_B2 illustrates example matrix progressions ID24_B200 to generate a NN dictionary. In the illustrated example of FIG. ID24_B2, the matrix progressions ID24_B200 include a weight matrix ID24_B202 (e.g., a tensor), a cluster index (centroids) ID24_B204, an approximated weight matrix ID24_B206, and a compressed matrix ID24_B208 (e.g., a compressed tensor). In some examples the weight matrix ID24_B202 is referred to as a tensor or a tensor matrix. In operation, a particular weight matrix ID24_B202 includes a variety of values in which each value consumes a particular amount of memory based on a value type (e.g., floating point values (FP32, FP16), integer values (INT16, INT8), etc.). As disclosed in further detail below, a clustering algorithm is applied to the example weight matrix ID24_B202 to identify a cluster index (centroids) ID24_B204. In some examples, clusters are discretized values based on an average (e.g., (−0.5+−0.7)/2=−0.6). Each centroid is associated with a bit representation ID24_B210, and the example approximated weight matrix ID24_B206 is populated with the nearest centroid values. Finally, the centroid values are replaced with the previously identified bit (e.g., binary) representations ID24_B208, which consumes a substantially lower amount of memory for each value (e.g., a 2-bit binary representation per value versus a FP32 value representation).
Traditional approaches to developing dictionaries and/or determining tensor/matrix representations having a memory/storage requirement less than corresponding original tensor/matrix representations involved identifying optimized centroid values. For instance, prior techniques identify centroids as a grouping of unique values, which defines a search space (Rc), where c represents a number of identified centroids. Within this limited search space Rc, traditional techniques adjust centroid representations in connection with a loss function in view of centroid weight values. As a result, updated centroid values are determined by traditional techniques to reduce (e.g., minimize) loss, but the weight values themselves never change. In other words, because Rc is a confined search space defined by the number of centroids, merely adjusting those centroids offers a lost opportunity when compressing tensors/matrices. Examples disclosed herein are not limited to the confined search space RC, but instead update tensor weights themselves.
FIG. ID24_A1 illustrates an example optimizer circuit ID24_A102 to generate and/or otherwise invoke NN dictionaries. In the illustrated example of FIG. ID24_A1, the optimizer circuit ID24_A102 includes an example matrix retriever ID24_A104, an example cluster engine ID24_A106, an example loss calculator circuit ID24_A108, an example gradient calculator ID24_A110, an example dictionary builder circuit ID24_A112, and an example matrix decompressor circuit ID24_A114. In operation, the example matrix retriever ID24_A104 retrieves, receives and/or otherwise obtains a weight matrix (e.g., a tensor), such as the example weight matrix ID24_B202 of FIG. ID24_B2. The example optimizer circuit ID24_A102 performs dictionary discovery in an iterative manner consistent with an example first tuning methodology ID24_A116 and/or an example second tuning methodology ID24_A118.
In view of the example first tuning methodology ID24_A116, the example cluster engine ID24_A106 performs one or more clustering techniques (e.g., k-means clustering algorithm) to identify, define and/or otherwise learn of unique cluster values of the weight matrix ID24_B202. In view of identified clusters, the cluster engine ID24_A106 calculates corresponding weights to best fit those centroid(s). Such calculations include a degree of error (loss). Accordingly, the example loss calculator circuit ID24_A108 performs a forward pass and calculates loss values corresponding to the calculated weights of the weight matrix. As the model (e.g., a neural network) learns, an error value between the original weights (W) and clustered weights (W′C) becomes smaller (e.g., converges). In particular, during the weight update operation an assumption is made that the forward pass occurs with the original weights rather than the clustered weights in a manner consistent with example Representation ID24_1.
W′=cluster(W) Representation ID24_1.
The example gradient calculator ID24_A110 calculates gradient value(s) (e.g., approximations) for each unique weight in a backward pass in a manner consistent with example Representation ID24_2.
Because the assumed identity associates a derivative of clustered weights with respect to initial weights, in some examples the derivative of the clustered weights need not be calculated. In other examples, the example optimizer circuit ID24_A102 calculates the derivative of the clustered weights with respect to the initial weights, and associates the derivative with an identity function. Rather than merely update centroid values, as in prior techniques, examples disclosed herein iterate to re-learn both the centroids and the weights at every step in a manner consistent with Representation ID24_3.
In the illustrated example of Representation ID24_3, n represents a learning weight (e.g., a scalar). For instance, initial iterations of example Representation ID24_3 cause initial changes (e.g., swings) having a magnitude relatively greater than subsequent iterations in the effort to converge. As such, some approaches to set the learning weight (n) to a relatively lower value after a threshold number of iterations facilitates the ability to continue to converge with improved granularity and reduce overshoot. Additionally, example Representation ID24_3 enables examples disclosed herein to avoid constraining and/or otherwise limiting the search space to only those finite centroids and, instead, enable the weights to be modified in the effort to minimize loss and converge. The approximation lies in the stability of the clustering algorithm used, and the gradient backpropagation is performed in a manner consistent with example Representation ID24_4.
In particular, the difference between the clustered weight and the original weight becomes an identity and it is assumed that the gradients are the same. The assumption is that there is no real delta when advancing from the clustered weights to the non-clustered weights. As training continues with a network, this assumption gains validity.
In addition to the example first tuning methodology ID24_116 of FIG. ID24_A2, examples disclosed herein also facilitate the example second tuning methodology AD24_118. In the illustrated example of FIG. ID24_A2, a network is defined (Net). The defined network (Net) includes two arguments (1) an input (x) and (2) parameters (W). In some examples, the input is an image, and the parameters define the weight values. A backup of the original weights is generated as W′ and replaces W with W′ during a restoration operation. Additionally, in the illustrated example second tuning methodology ID24_118, the forward pass occurs with W′(c) by using Net(x; W′c). Thereafter, gradients are calculated and weights are restored after the gradient calculation, which aligns with an identity derivative between clustered and original weights. In view of the order of the example second tuning methodology, improved accuracy is at least one beneficial result, particularly with low bit-width quantizations.
The example dictionary builder circuit ID24_A112 associates each value with a bit representation and replaces weights in the original weight matrix with the closest unique values, thereby generating an augmented weight matrix. The example dictionary builder circuit ID24_A112 also replaces unique values with their corresponding bit representations, and the example optimizer circuit ID24_A102 applies one or more minimal description length assignments. Accordingly, compressed weights and/or weight matrices are of a size relatively smaller than their original configuration and consume less storage space and require less bandwidth when transmitting models (e.g., to other devices, such as Edge devices having constrained memory, power and/or communication capabilities). In some non-limiting examples, the optimizer circuit ID24_A102 determines whether histogram buckets may have exceeded a threshold and, if so, the loss function is modified to be proportional to the variance of a Gaussian fitting the histogram points. For instance, in the event a variable length compression is to be performed to gain further space savings, loss functions are created to increase frequency of some centroids to allow for greater variable length compression. In other examples, the clustering algorithm is changed to increase a number of weights, which facilitate greater variable length compression. Examples disclosed herein optionally expose one or more knobs to facilitate selection and/or adjustment of selectable options. Knobs may be selected by, for example, a user and/or an agent. Agent knob adjustment may occur in an automatic manner independent of the user in an effort to identify one or more particular optimized knob settings. In some examples, knobs are used to control a level of a fixed length compression (dictionary size) that is required, which comes at the expense of a degree of accuracy. In other examples, one or more weight restrictions are imposed/set to facilitate greater variable length compression to enable better overall model compression. One such example includes an L1 loss to increase model sparsity, which allows for greater compression of the network for communication purposes.
FIG. ID24_C illustrates an example compression process ID24_C300. In the illustrated example of FIG. ID24_C, an INT8 to INT4 compression process is illustrated with an assumption that 16 unique values occur at the compression stage. Two INT8 values are packed into a single INT8 value using INT4 keys. In some examples, the dictionary builder circuit ID24_A112 interleaves the compressed tensor to avoid and/or otherwise reduce overhead at the subsequent decompression. For instance, particular hardware and associated instruction set architectures (ISAs) may benefit from interleaving, such as the example AVX 512 ISA. In some examples, interleaving is not performed (e.g., to avoid any need for additional bit shift operations during decompression).
FIG. ID24_D illustrates example dictionaries in linear memory ID24_D400 available for inference. In the illustrated example of FIG. ID24_D, the linear memory ID24_D400 includes four dictionaries and four corresponding indexes but examples disclosed herein are not limited thereto. Additionally, the illustrated example of FIG. ID24_D includes dictionaries that are 128 bits in length, but examples disclosed herein are not limited thereto and such lengths may dynamically change in view of the number of unique values discovered. As such, in the event a relatively fewer number of unique values is identified for a particular NN, decompression overhead is reduced with a correspondingly shorter bit length. Particular selection of dictionaries is accomplished with a 2-bit index per tile, which accommodates identification of four separate dictionaries. While the illustrated example of FIG. ID24_D includes four dictionaries with a 2-bit index, examples disclosed herein are not limited thereto.
FIG. ID24_E illustrates example matrix/tensor decompression ID24_E500 during inference time. In other words, replacing an INT4 key with a corresponding INT8 value. In operation, the example matrix decompressor circuit ID24_A114 receives, retrieves and/or otherwise obtains a compressed matrix/tensor (e.g., INT4) and extracts lower and upper nibbles in two separate registers because the amount of information is being doubled from INT4 to INT8. For the sake of this example, the compressed tensor is 512 bits wide and will be decompressed to 1024 bits wide. The example matrix decompressor circuit ID24_A114 initializes a register with a dictionary key that is to be replaced. For the sake of this example, assume that the key 1101 is to be replaced with a corresponding decompressed value. The example matrix decompressor circuit ID24_A114 generates a mask for the key and the corresponding nibble, which is performed for each of the two registers above. As such, the masks identify where replacement is to occur, which is repeated for each key to generate the decompressed matrix.
Worth noting is that any decompression process is typically processor intensive. In circumstances where the number of bits increases by 1, the computational cost increases exponentially. Accordingly, reducing the dictionary bit width as disclosed above has a substantial effect on device efficiency during inference. This is particularly helpful at Edge locations with computing devices having fewer resources, such as IoT devices and/or mobile devices with limited computing power, bandwidth capabilities and/or on-board energy resources (e.g., battery).
While an example manner of implementing the optimizer circuit ID24_A102 of FIG. ID24_A is illustrated in FIG. ID24_A, one or more of the elements, processes and/or devices illustrated in FIG. ID24_A may be combined, divided, re-arranged, omitted, eliminated and/or implemented in any other way. Further, the example matrix retriever ID24_A104, the example cluster engine ID24_A106, the example loss calculator circuit ID24_A108, the example gradient calculator ID24_A110, the example dictionary builder circuit ID24_A112, the example matrix decompressor circuit ID24_A114 and/or, more generally, the example optimizer circuit ID24_A102 of FIG. ID24_A may be implemented by hardware, software, firmware and/or any combination of hardware, software and/or firmware. Example hardware implementation include implementation on the example compute circuitry D102 (e.g., the example processor D104) of
Flowcharts representative of example hardware logic, machine-readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the optimizer circuit ID24_A102 of FIG. ID24_A is shown in FIGS. ID24_F through ID24_I. The machine-readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor and/or processor circuitry, such as the processor D152 shown in the example processor platform D150 discussed above in connection with
As mentioned above, the example processes of FIGS. ID24_F through ID24_I may be implemented using executable instructions (e.g., computer and/or machine-readable instructions) stored on a non-transitory computer and/or machine-readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information). As used herein, the term non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
The program ID24_F100 of FIG. ID24_F includes block ID24_F102, where the example matrix retriever ID24_A104 retrieves, receives and/or otherwise obtains a weight matrix, such as the example weight matrix ID24_B202 of FIG. ID24_B2. The example cluster engine ID24_A106 performs a clustering technique to identify and/or otherwise learn of unique cluster values of the weight matrix ID24_B202 (block ID24_F104), and the example loss calculator circuit ID24_A108 performs a forward pass with clustered weights and calculates loss values corresponding to the weight matrix (block ID24_F106). The example gradient calculator ID24_A110 calculates gradients for each unique weight in a backward pass (block ID24_F108), and the example optimizer circuit ID24_A102 calculates the derivative of the clustered weights with respect to the initial weights (block ID24_F110), associates the derivative with an identity function (ID24_A112), and iterates to re-learn both the centroids and the weights at every step (block ID24_F114). However, in some examples blocks ID24_F110 and ID24_F112 need not be performed and, instead, weight restoration may occur after re-learning centroids (block ID24_F114).
If convergence has not yet occurred (block ID24_A116), control returns to block ID24_A104, otherwise the example dictionary builder circuit ID24_A112 packages the built dictionaries for runtime (block ID24_A118). Turning to FIG. ID24_F′, the example dictionary builder circuit ID24_A112 (see FIG. ID24_A) associates each value with a particular bit representation (block ID24_F202) and replaces weights from the original weight matrix with the closest unique values to generate an augmented weight matrix (block ID24_F204). The example dictionary builder circuit ID24_A112 replaces unique values with a bit representation as dictionary keys, and the example optimizer circuit ID24_A102 applies one or more types of minimal description length assignment (block ID24_F208).
Turning to FIG. ID24_G, the example optimizer circuit ID24_A102 determines whether histogram buckets exceed a threshold value (block ID24_G302). In some examples, if the threshold(s) has/have been exceeded (block ID24_G302), then the example optimizer ID24_A102 modifies the loss function to be proportional to the variance of the Gaussian fitting the histogram points (block ID24_G304). In other examples, the optimizer circuit ID24_A102 optionally changes a clustering algorithm during a next iteration to increase a number of weights (block ID24_G306).
FIG. ID24_I illustrates example matrix/tensor decompression during inference time. In the illustrated example of FIG. ID24_I, the matrix decompressor circuit ID24_A114 extracts lower and upper nibbles in two separate registers (block ID24_1402) because the amount of information is being doubled from INT4 to INT8. The example matrix decompressor circuit ID24_A114 initializes a register with a dictionary key that is to be replaced (block ID24_I404). The example matrix decompressor circuit ID24_A114 generates a mask for the key and the corresponding nibble, which is performed for each of the two registers above (block ID24_I406). The masks identify where replacement is to occur, which is repeated for each key to generate the decompressed matrix (block ID24_I408).
Example methods, apparatus, systems, and articles of manufacture to optimize resources in Edge networks are disclosed herein. Further examples and combinations thereof include the following:
Example 189 includes an apparatus comprising at least one of a central processing unit, a graphic processing unit or a digital signal processor, the at least one of the central processing unit, the graphic processing unit or the digital signal processor having control circuitry to control data movement within the processor circuitry, arithmetic and logic circuitry to perform one or more first operations corresponding to instructions, and one or more registers to store a result of the one or more first operations, the instructions in the apparatus, a Field Programmable Gate Array (FPGA), the FPGA including logic gate circuitry, a plurality of configurable interconnections, and storage circuitry, the logic gate circuitry and interconnections to perform one or more second operations, the storage circuitry to store a result of the one or more second operations, or Application Specific Integrate Circuitry including logic gate circuitry to perform one or more third operations, the processor circuitry to at least one of perform at least one of the first operations, the second operations or the third operations to calculate clusters corresponding to original weight values of a weight matrix, calculate first clustered weight values, initiate a forward pass to calculate loss values based on the clustered weight values, calculate gradients corresponding to the clustered weight values, and calculate second clustered weight values based on a difference between the original weight values and the gradients.
Example 190 includes the apparatus as defined in example 189, wherein the processor circuitry is to modify the gradients with a learning weight.
Example 191 includes the apparatus as defined in example 190, wherein the processor circuitry is to modify the learning weight based on a threshold number of iterations.
Example 192 includes the apparatus as defined in example 189, wherein the processor circuitry is to obtain the weight matrix from a model, the model to be compressed prior to execution on a network device.
Example 193 includes the apparatus as defined in example 189, wherein the processor circuitry is to associate the second clustered weight values with key values, and replace the original weight values of the weight matrix with the key values corresponding to the second clustered weight values.
Example 194 includes the apparatus as defined in example 189, wherein the processor circuitry is to iteratively calculate the second clustered weight values until a threshold convergence value is satisfied.
Example 195 includes the apparatus as defined in example 189, wherein the processor circuitry is to invoke a k-means clustering algorithm to calculate the clusters.
Example 196 includes At least one non-transitory computer readable storage medium comprising instructions that, when executed, cause at least one processor to at least calculate clusters corresponding to original weight values of a weight matrix, calculate first clustered weight values, initiate a forward pass to calculate loss values based on the clustered weight values, calculate gradients corresponding to the clustered weight values, and calculate second clustered weight values based on a difference between the original weight values and the gradients.
Example 197 includes the computer readable storage medium as defined in example 196, wherein the instructions, when executed, cause the at least one processor to modify the gradients with a learning weight.
Example 198 includes the computer readable storage medium as defined in example 197, wherein the instructions, when executed, cause the at least one processor to modify the learning weight based on a threshold number of iterations.
Example 199 includes the computer readable storage medium as defined in example 196, wherein the instructions, when executed, cause the at least one processor to obtain the weight matrix from a model, the model to be compressed prior to execution on a network device.
Example 200 includes the computer readable storage medium as defined in example 197, wherein the instructions, when executed, cause the at least one processor to associate the second clustered weight values with key values, and replace the original weight values of the weight matrix with the key values corresponding to the second clustered weight values.
Example 201 includes the computer readable storage medium as defined in example 197, wherein the instructions, when executed, cause the at least one processor to iteratively calculate the second clustered weight values until a threshold convergence value is satisfied.
Example 202 includes the computer readable storage medium as defined in example 197, wherein the instructions, when executed, cause the at least one processor to invoke a k-means clustering algorithm to calculate the clusters.
Example 203 includes a method comprising calculating clusters corresponding to original weight values of a weight matrix, calculating first clustered weight values, initiating a forward pass to calculate loss values based on the clustered weight values, calculating gradients corresponding to the clustered weight values, and calculating second clustered weight values based on a difference between the original weight values and the gradients.
Example 204 includes the method as defined in example 203, further including modifying the gradients with a learning weight.
Example 205 includes the method as defined in example 204, further including modifying the learning weight based on a threshold number of iterations.
Example 206 includes the method as defined in example 203, further including obtaining the weight matrix from a model, the model to be compressed prior to execution on a network device.
Example 207 includes the method as defined in example 203, further including associating the second clustered weight values with key values, and replacing the original weight values of the weight matrix with the key values corresponding to the second clustered weight values.
Example 208 includes the method as defined in example 203, further including iteratively calculating the second clustered weight values until a threshold convergence value is satisfied.
Example 209 includes the method as defined in example 203, further including invoking a k-means clustering algorithm to calculate the clusters.
Example 210 includes an apparatus to generate dictionary weights, the apparatus comprising a cluster engine to calculate clusters corresponding to original weight values of a weight matrix, and calculate first clustered weight values, a loss calculator to initiate a forward pass to calculate loss values based on the clustered weight values, a gradient calculator to calculate gradients corresponding to the clustered weight values, and an optimizer to calculate second clustered weight values based on a difference between the original weight values and the gradients.
Example 211 includes the apparatus as defined in example 210, wherein the optimizer is to modify the gradients with a learning weight.
Example 212 includes the apparatus as defined in example 211, wherein the optimizer is to modify the learning weight based on a threshold number of iterations.
Example 213 includes the apparatus as defined in example 210, further including a matrix retriever to obtain the weight matrix from a model, the model to be compressed prior to execution on a network device.
Example 214 includes the apparatus as defined in example 210, wherein the optimizer is to associate the second clustered weight values with key values, and replace the original weight values of the weight matrix with the key values corresponding to the second clustered weight values.
Example 215 includes the apparatus as defined in example 210, wherein the optimizer is to iteratively calculate the second clustered weight values until a threshold convergence value is satisfied.
Example 216 includes the apparatus as defined in example 210, wherein the clustering engine is to invoke a k-means clustering algorithm to calculate the clusters.
Although certain example systems, methods, apparatus, and articles of manufacture have been disclosed herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all systems, methods, apparatus, and articles of manufacture fairly falling within the scope of the claims of this patent.
The following claims are hereby incorporated into this Detailed Description by this reference, with each claim standing on its own as a separate embodiment of the present disclosure.
This patent arises from the national stage of International Application No. PCT/US2021/039222, which was filed on Jun. 25, 2021. International Application No. PCT/US2021/039222 claims the benefit of U.S. Provisional Patent Application Ser. No. 63/130,508, which was filed on Dec. 24, 2020. Priority to International Application No. PCT/US2021/039222 and U.S. Patent Application Ser. No. 63/130,508 is hereby claimed.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/039222 | 6/25/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63130508 | Dec 2020 | US |