The present disclosure relates generally to machine learning and artificial intelligence (AI), and particularly to automatic computational resource prediction, resource-aware model tuning, and subsequent workload provisioning of AI tasks.
Modern applications/tasks may be deployed in a shared cluster of computing resources at scale. Such applications/tasks may rely on various machine learning and AI models that require careful configuration, training, and tuning. Performances of these machine learning and AI models usually correlate with complexity of these models, which may dictate the amount of computing resource consumption. Efficiency and accuracy in dynamic allocation and provisioning of computing resources shared by a large number of machine learning and AI models are thus critical for achieving an acceptable overall performance metrics constrained by resource availability.
To assist in understanding of the various implementations described in this disclosure, reference is made to the following accompanying drawings:
The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present disclosure in any manner.
The disclosure will now be described in detail hereinafter with reference to the accompanied drawings, which form a part of the present disclosure, and which show, by way of illustration, specific examples of embodiments. The disclosure may, however, be embodied in a variety of different forms and, therefore, the covered or claimed subject matter is intended to be construed as not being limited to any of the embodiments to be set forth below. Further, the disclosure may be embodied as methods, devices, components, or systems. Accordingly, embodiments of the disclosure may, for example, take the form of hardware, software, firmware or any combination thereof.
In general, terminology may be understood at least in part from usage in context. For example, terms, such as “and”, “or”, or “and/or,” as used herein may include a variety of meanings that may depend at least in part upon the context in which such terms are used. Typically, “or” if used to associate a list, such as A, B or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B or C, here used in the exclusive sense. In addition, the term “one or more” or “at least one” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a”, “an”, or “the”, again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” or “determined by” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for the existence of additional factors not necessarily expressly described, again, depending at least in part on context.
Modern machine learning or artificial intelligence (AI) applications/tasks may require careful tuning and resource prediction depending on the hardware architecture used to run these applications/tasks as well as business requirements (e.g., computational time and/or cost). In general, high-performance computing (HPC) clusters such as computing resources distributed in a cloud typically consist of a diverse set of hardware (CPU, GPU, FPGA, etc.) and memory architecture shared by a large number of user computational jobs or tasks. The terms “user job” and “user task” are used interchangeably in this disclosure. Under these circumstances, minimization of cost of utilizing a large computing cluster and maintaining optimal performance for the submitted user applications/tasks may be affected by both predictable events (such as the priority of the submitted applications) and unpredictable events (such as early termination of running jobs which frees up previously occupied computing resources). Traditionally, the selection of hardware architecture and resource adjustment in the computing cluster for user jobs/tasks mainly rely on manual rules and estimates provided by human operators and experts, which leads to inefficiencies, performance degradations, and wasted computing resources.
In this regard, more advanced techniques and algorithms may be constructed to build efficient and intelligent systems that are capable of comprehending and responding at multiple levels to both predictable and uncertain events to dynamically select/suggest hardware architecture, to predict computing resource requirement for user applications/tasks, to tune the user applications (e.g., tuning the machine learning and AI models used in the user applications), and to assign user tasks to computing resources based on both performance target metrics and resource availability, cost, and other constraints.
The present disclosure describes a system, a method, and a product for computational resource prediction of user applications/tasks and subsequent workload provisioning. The computational resource predictions for a user job/task may be achieved using a resource prediction twin machine learning and AI system based on, for example, generative modeling and probabilistic programing. Such a resource prediction twin may provide metrics of interest and tradeoffs between hardware, runtime, and cost metrics to the user for their submitted jobs/tasks. As an example generative model, the resource prediction twin may be further configured to self-improve or evolve over time via probabilistic programing. The workload scheduling and assignment of the user tasks in a computing cluster with components having diverse hardware architectures may be provisioned by an automatic and intelligent assignment/provisioning engine, which may be based on various machine learning and AI models. The automatic workload scheduling and assignment engine may be configured to handle unpredicted uncertainty and adapt to constantly evolving system queues of the tasks submitted by the users to generate queuing/re-queuing, running/termination, and resource allocation/reallocation actions for user tasks. The automatic workload scheduling and assignment engine may be configured to self-improve and evolve via deep reinforcement learning. The methods and systems disclosed herein improve computational resource allocation and provisioning efficiency using AI and other optimization techniques.
While the example embodiments below are described in the context that the user applications/tasks themselves are based on machine learning and AI models that may be trained and tuned according to performance targets and computational resource constraints, the principles underlying the resource prediction twin and the workload scheduling, assignment and provisioning engine are applicable to other user tasks and applications that do not involve machine learning and AI models.
By way of reference, related U.S. application Ser. No. 16/749,717 entitled “Resource-Aware Automatic Machine Learning System” filed on Jan. 22, 2020, and U.S. application Ser. No. 17/182,538 entitled “Resource Prediction System for Executing Machine learning Models” filed on Feb. 23, 2021, both belonging to the current applicant, are incorporated herein by reference in their entireties.
The servers 102 and 104 may be implemented as a central server or a plurality of servers distributed in the communication networks. While the servers 102 and 104 shown in
The user devices 112, 114, and 116 may be any form of mobile or fixed electronic devices, including but not limited to a desktop personal computer, laptop computers, tablets, mobile phones, personal digital assistants, and the like. The user devices 112, 114, and 116 may be installed with a user interface for submitting user applications/tasks and for accessing the system for computational resource prediction, resource-aware model tuning, and subsequent workload provisioning. The one or more databases 118 of
The communication interfaces 202 may include wireless transmitters and receivers (“transceivers”) 212 and any antennas 214 used by the transmitting and receiving circuitry of the transceivers 212. The transceivers 212 and antennas 214 may support Wi-Fi network communications, for instance, under any version of IEEE 802.11, e.g., 802.11n or 802.11ac. The communication interfaces 202 may also include wireline transceivers 216. The wireline transceivers 216 may provide physical layer interfaces for any of a wide range of communication protocols, such as any type of Ethernet, data over cable service interface specification (DOCSIS), digital subscriber line (DSL), Synchronous Optical Network (SONET), or other protocol.
The storage 209 may be used to store various initial, intermediate, or final data or models for implementing the system for computational resource prediction, resource-aware model tuning, and subsequent workload provisioning. These data corpus may alternatively be stored in the database 118 of
The system circuitry 204 may include hardware, software, firmware, or other circuitry in any combination. The system circuitry 204 may be implemented, for example, with one or more systems on a chip (SoC), application-specific integrated circuits (ASIC), microprocessors, discrete analog, digital circuits, and other circuitry.
For example, at least some of the system circuitry 204 may be implemented as processing circuitry 220 for the server 102, including the system for computational resource prediction, resource-aware model tuning, and subsequent workload provisioning of
Alternatively, or in addition, at least some of the system circuitry 204 may be implemented as client circuitry 240 for the user devices 112, 114, and 116 of
In the context of managing user tasks and applications in a shared HPC cluster,
In contrast,
The user task parameters and metrics submitted by the user may then be processed/analyzed by an automatic computing resource prediction twin 420. The automatic resource prediction twin 420 (alternatively referred to as resource predication twin) may include one or more trained machine learning and AI models. For example, the resource prediction twin 420 may be based on a generative model that can self-improve or evolve using probabilistic programing. While it acts as a twin of the task submitted by the user, it may generate resource prediction by simulating the user task with respect to various types of hardware platforms and components without actually running or fully running the user AI models on the different hardware platforms. For example, the resource prediction twin 420 may be implemented by creating a hardware platform to run the resource prediction twin to compare various machine learning model's attributes and performances with respect to different hardware platforms (e.g., compare the power usage, cost, latency and accuracy of machine learning models on FPGA, GPU, CPU, etc.). Examples of the composition, training, and operation of the resource prediction twin 420 are described in U.S. application Ser. No. 17/182,538 entitled “Resource Prediction System for Executing Machine learning Models” filed on Feb. 23, 2021, which belongs to the same applicant of the instant application and is incorporated herein by reference in its entirety. Further description of the resource prediction twin 420 is provided below in relation to
In some implementations, the resource prediction twin 420 may perform resource prediction based on a user AI model architecture associated with a set of model hyper-parameters that have already been optimized elsewhere using, for example, the resource-aware machine learning system described in U.S. application Ser. No. 16/749,717 entitled “Resource-Aware Automatic Machine Learning System” filed on Jan. 22, 2020, herein incorporated by reference in its entirety, or other AI models. The resource prediction twin 420 may be mainly concerned with generating resource predictions with respect to various hardware platform and component recommendation based on the optimized architecture and hyper parameters of the user machine learning model. For example, a user AI model with the same model architecture (e.g., the same hyper parameters) may perform differently with different computing platforms or components and thus the resource prediction based on target metrics for the user AI may be platform-dependent.
In some implementations, in addition to the already optimized architecture and hyper parameters, the user machine learning model/task may have already been completely trained elsewhere and is input into the resource prediction twin 420 for resource prediction. The input user AI model to the resource prediction twin 420 may thus include both a set of hyper parameters and a set of trained model parameters.
In some other implementations, the resource prediction twin 420, in addition and together with resource prediction in various hardware platforms under the user specified target metrics or constraints (hardware usage, cost, time constraints, accuracy, and the like), may be configured to optimize the user AI model architectures represented by, for example, its hyper parameters based on some initial hyper parameters.
Correspondingly, the output of the resource prediction twin 420 may be configured in various forms. For example, the output of the resource prediction twin 420 may include resource prediction for each different computing platform according to the user task parameters and target metrics. The output may further include an indication of the optimal computing platform under user target metrics and constraints. For another example, the output of the resource prediction twin 420 may include optimized resource allocation for the input user task considering tradeoffs between various competing user target metrics or goals (e.g., tradeoff between reducing computing cost and minimizing run time). For another example, the output of the resource prediction twin 420 may include quantification of competing factors such as computing cost and performance of the user task in different computing platforms. For yet another example, the output of the resource prediction twin 420 may include optimized hyper parameters for the user task in addition to resource prediction.
As further shown in
As shown in
Continuing with
A telemetry Application Programming Interface (API) 470 associated with the HPC cluster 450 may be further provided for monitoring the status of the HPC cluster 450 with respect to, for example, hardware operating and usage status of the HPC cluster 450, hardware/software license consumption status, network and storage status, and status of the user tasks and jobs, as shown in
Additionally or alternatively, the resource prediction twin 420 may output the prediction of a resource additionally based on logs, scripts, and collected data 530, hardware data 540, and/or user/group/organization data 570. In the case of basing the resource prediction on the logs, scripts, and collected data 530, the resource prediction twin 420 may be configured to predict various computing resource metrics. For example, the resource prediction twin 420 may predict how long each job or task would take to complete if the user 510 desires to run the job on a GPU versus CPU with a certain configuration. In the case of basing the resource prediction on hardware data 540, the resource prediction twin 420 may further predict or adjust user machine learning or AI model parameters. For example, the resource prediction twin 420 may predict a user machine learning or AI model configuration that may satisfy target metrics under given resource constraints (e.g., memory and/or CPU limitations, etc.). In the case of basing the resource prediction on the user/group/organization data 570, the resource prediction twin 420 may predict computing resource considering user, group, and organization relationships and priorities. The data 530, 540, and 570 may be considered separately or holistically.
The resource prediction twin 420 may benefit the user 510 by generating predictions for how to satisfy several competing goals without running an actual machine learning model, or benchmarking different models against different sets of hardware platforms. In some implementations of the present disclosure and as shown by 580 of
In some other implementations, a hardware platform independent user machine learning and AI models may be optimized using a separately trained resource-aware model optimization algorithm (such as the MOBOGA). Model parameters (including hyper parameters and or trained parameters) of such separately optimized user machine learning and AI models and other user task information may be used as input to the resource prediction twin 420. Such task specific parameters may be collected and used as part of the data for training the resource prediction twin 420, as shown by 590 in
In some implementations, the resource prediction twin 420 may utilize at its core a generative model 550 that is trained and then evolve based on inference and probabilistic programming. It may be trained based on parameters of the user machine learning model and measured (observed) metrics. The probabilistic programming system as used for the resource prediction twin 420 is explained further below with reference to
In order to output the prediction of resource, the resource prediction twin 420 may be trained on a wide variety of data associated with different resources as shown by 530, 570, and 540 of
The resource prediction twin 420 may output the resource prediction by providing the trade-off between optimized cases and their targeted objectives (e.g., inference time, memory, latency, model accuracy, and the like). These optimal cases may be represented by, for example, Pareto-Optimal options 595, as explained in U.S. application Ser. No. 16/749,717 and U.S. application Ser. No. 17/182,538, herein incorporated by reference. For example, in outputting the prediction of resource as shown in 560, the resource prediction twin 420 may provide trade-offs between estimated cost and runtime for a different number of GPUs and memory.
In some implementations, the resource prediction twin 420 may utilize a generative model trained using probabilistic programming. Using probabilistic programming may facilitate building generative models as programs to solve an inference problem from observed incomplete data. By providing the data generation process as a program, the probabilistic programming approach may perform the inference problem. The uncertainties in these estimated and inferred parameters may also be quantified. Therefore, probabilistic programming may help capture the real-world process of data generation, whereas traditional or conventional machine learning models perform feature engineering and transformation to make data fit into a model.
In some implementations, a probabilistic model (e.g., in the form of relationship between metrics y and job/task parameters x: y=θ1x1+θ2x2+ . . . , where θ represents the model parameters of the probabilistic model) to predict metrics y such as memory usage, power consumption, model accuracy from job/task parameters x. Some measured or observed metrics data may have been collected. The goal of the probabilistic programing is to determine and improve the model parameters θ for the generative model. One example way is to first guess the distribution of the θ parameters (guessed parameters) based on an inference of θ and the observed metrics, and simulate a distribution of the metrics (memory usage, power consumption, model accuracy). Then when new data comes in, the model is used to make an inference.
The generative model 550 start with job/task parameter 650 as its guessed model parameters θ, based on the observed metrics 610 (including, for example, memory usage, power consumption, and model accuracy may be collected). The model is able to obtain simulated metrics 660 by generating virtual job(s) 670. The simulated metrics 660 may include, for example, simulated memory usage, power consumption, model accuracy for the user task. The observed metrics 610 and the simulated metrics 660 by the generative model 550 may be received at an inference model 620. The inference model 620 may process the observed metrics 610 and the simulated metrics 660 and generate probabilistic models that can explain the input metrics. The inference model 620 may be used to infer a set of guessed model parameters 630 and their distributions for the generative model 550 by generating second virtual jobs 640 to infer a new set of model parameters θ for the generative model 550.
In one example application of the resource prediction twin 420, a user through the design engineer 410 of
The automatic and intelligent job/task assignment engine 430 may be configured as a separate machine learning model (such as a regression model or a neural network) in order to automatically and adaptively determine the optimized resource assignment among the computing resources within the HPC cluster and provision various job/task queues.
The automatic job/task assignment engine 430 may include a robust machine learning and inference model plus intelligent policies to provide optimal schedules while dealing with uncertainties and inaccuracies inherent to other prediction models. The automatic job/task assignment engine 430 may go beyond using simple heuristics by further taking into account workloads and unpredicted uncertainties, and keep various policies updated automatically and efficiently.
In some implementations, the automatic and intelligent job/task assignment engine 430 may include component 710 for resource selection and optimization (e.g., stochastic resource optimization such as Markov Decision Process (MDP)), and component 712 for state and policy inference, and component 714 for performing deep reinforcement learning.
The output of the automatic and intelligent job/task assignment engine 430 may include action items 750. An action item with respect to the submitted user job/task may be automatically triggered and executed by the automatic and intelligent job/task assignment engine 430. The action items 750 may be directed to job/task assignment and/or queue provisioning, including, for example, an update of the job queue, an update of the HPC cluster information, and update of HPC and various other states and scheduling/assignment policies.
In some implementations, the automatic and intelligent job/task assignment engine 430 may include a machine learning model that maps the various input data, including the resource prediction generated by the resource prediction twin 420 and other data in 740, or their transformed forms, to one or more action items among a set of possible action items. The set of action items may be predefined action classes and may be expanded during deep reinforcement learning of the intelligent job/task assignment engine 430. Such mapping may be performed according to a trained policy as shown in component 712 applied to input data associated with input job/task 800 and metrics data 740.
The intelligent job/task assignment function may further consider unpredicted uncertainties 840 and may be based on the assignment policies 850. The uncertainties for example, may include unpredicted hardware failures and inaccurate resource predictions by the resource prediction twin. Historical occurrences of these unpredicted uncertain events may be considered by the assignment function in addition to the machine-job matching scores 820 and the assignment policy 850 in generating the predicted action among the set of predefined scheduling actions 880 by a resource assignment function 830. The predefined set of scheduling actions including but not limited to assigning the job/task to a computing resource, and keeping the job/task in the task queue.
The assignment policy 850 as one of the input items considered by the resource assignment function 830, for example, may be generated by a trained machine learning model as shown by 860 of
The resource prediction twin machine learning model 420 described above for outputting predictions and the automatic and intelligent assignment/workload provisioning engine 430 may provide metrics of interest to the user for their submitted jobs/tasks. In addition, the twin machine learning models are able to handle uncertainty and automatically adapt to constantly evolving system queues and the submitted jobs via inference and reinforcement learning. Further, because the resource prediction twin machine learning model 420 provides the prediction of various metrics (e.g., accuracy, latency, memory consumption fora given model), engineers (e.g., Machine Learning engineer, deployment engineer) may use these models to help reduce the time and overall cost of operations. As such, the resource prediction twin machine learning model 420 may be used as a platform to explore various scenarios under which these machine learning models can be retrained and deployed. For example, the twin machine learning models and workload provisioning system may be utilized in intelligent and system-aware job assignments on high performance computing/clusters.
The resource prediction twin 420 and the intelligent assignment/workload provisioning engine 430 may be performed by a circuitry. The circuitry may further include or access instructions for execution by the circuitry. The instructions may be stored in a tangible storage medium that is other than a transitory signal, such as a flash memory, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM); or on a magnetic or optical disc, such as a Compact Disc Read Only Memory (CDROM), Hard Disk Drive (HDD), or other magnetic or optical disk; or in or on another machine-readable medium. A product, such as a computer program product, may include a storage medium and instructions stored in or on the medium, and the instructions when executed by the circuitry in a device may cause the device to implement any of the processing described above or illustrated in the drawings.
The implementations may be distributed as circuitry among multiple system components, such as among multiple processors and memories, optionally including multiple distributed processing systems. Parameters, databases, and other data structures may be separately stored and managed, may be incorporated into a single memory or database, may be logically and physically organized in many different ways, and may be implemented in many different ways, including as data structures such as linked lists, hash tables, arrays, records, objects, or implicit storage mechanisms. Programs may be parts (e.g., subroutines) of a single program, separate programs, distributed across several memories and processors, or implemented in many different ways, such as in a library, such as a shared library (e.g., a Dynamic Link Library (DLL)). The DLL, for example, may store instructions that perform any of the processing described above or illustrated in the drawings, when executed by the circuitry.
While the particular disclosure has been described with reference to illustrative embodiments, this description is not meant to be limiting. The attached appendix is part of this disclosure, and is incorporated herein as further illustrative embodiments. Various modifications of the illustrative embodiments and additional embodiments of the disclosure will be apparent to one of ordinary skill in the art from this description. Those skilled in the art will readily recognize that these and various other modifications can be made to the exemplary embodiments, illustrated and described herein, without departing from the spirit and scope of the present disclosure. It is therefore contemplated that the appended claims will cover any such modifications and alternate embodiments. Certain proportions within the illustrations may be exaggerated, while other proportions may be minimized. Accordingly, the disclosure and the figures are to be regarded as illustrative rather than restrictive.
This application is based on and claims priority to the U.S. Provisional Application No. 63/051,167 filed on Jul. 13, 2020, which is herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63051167 | Jul 2020 | US |