An artificial intelligence (AI) inferencing platform may implement various trained AI models, used to generate outputs in response to input prompts provided by users. Such platforms may use various techniques in an attempt to efficiently allocate different user AI workloads to different processing units within the platform, to thereby optimize the use of processing units, reduce latency, and balance the load across the network.
A method for artificial intelligence (AI) inferencing workload allocation includes, at a computing device of a distributed AI inferencing platform, receiving an estimated prompt load and an estimated generation load of an AI inferencing workload to be fulfilled by a processing unit of a computing node of the distributed AI inferencing platform. Based at least in part on the estimated prompt load and the estimated generation load, an inference unit (IU) processing load is estimated, the IU processing load to be applied to the processing unit while fulfilling the AI inferencing workload. Fractional processing capacity of the processing unit is allocated for fulfilling the AI inferencing workload based at least in part on the IU processing load.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
A distributed artificial intelligence (AI) inferencing platform can be used to provide different AI models for vision, speech, language, decision-making, and/or other suitable modalities, and make such models available to users for inferencing. In some examples, the users are customers of the distributed AI inferencing platform. Such a platform may allow users to create their own AI models, which can be hosted on the platform and used to perform inferencing on user-provided data. Additionally, or alternatively, the platform may host various pre-trained models. Models accessible through the inferencing platform may include Large Language Models (LLMs), speech models, image models, various multimodal models including traditional machine learning models, and/or any other suitable machine-learning (ML) or AI models.
The models are executed by suitable computer logic hardware, such as via a plurality of different processing units distributed between a plurality of computing nodes of the distributed AI inferencing platform. The present disclosure primarily describes the processing units used to execute the AI models and perform inferencing on user-provided input prompts as being graphics processing units (GPUs). In other words, in some examples, a processing unit is a graphics processing unit. It will be understood, however, that this is not limiting-any suitable computer logic hardware may be used. In some examples, the computing devices used to instantiate the distributed AI inferencing platform are implemented as computing system 1200 described below with respect to
As used herein, a “workload” refers to a set of one or more input prompts provided to the distributed AI inferencing platform for inferencing over time. It can be challenging to coordinate resource sharing between different users—e.g., to distribute different user workloads between the different processing units in a manner that provides stable and consistent performance for each user. In one approach, different users may purchase or otherwise acquire a desired number of Provisioned Throughput Units (PTUs), which represent inferencing capacity that user can use to fulfill their inferencing workloads. As an example, some users who desire a predictable and dedicated inferencing capacity can purchase capacity equivalent to a single GPU or GPU batch. However, once issued, this capacity typically cannot be used by other users, which can contribute to resource stranding and potential lost revenue for the owner of the distributed AI inferencing platform.
For users that do not need a dedicated capacity and/or are more price conscious, another option includes providing such users with access to a shared pool with a rate limiter in place. However, some users end up exhausting their token processing limit, and performance can be impacted during periods of high utilization—e.g., when the available compute resources of the pool are saturated by other user workloads. This can be referred to as a “noisy neighbor” phenomenon, where one user's latency is negatively affected by the fact that utilization of the shared pool by other users is currently high.
As such, the present disclosure describes techniques for AI model workload allocation that can provide fractional reserved processing capacity. In other words, users may be assigned fractional capacity allocations that represent a fraction of the total capacity of a single processing unit. This may enable a distributed AI inferencing platform to provide consistent latency and throughput performance for user workloads, particularly for workloads with stable characteristics such as prompt and completion size and number of concurrent requests.
However, providing such consistent performance can pose various challenges. Users may bring a wide variety of different workloads to the distributed AI inferencing platform. These workloads can be classified as different types including, for example, generation-heavy, prompt-heavy, and balanced workloads. These combinations lead to varying latencies. Some prompts can be similar in nature, allowing the model to internally optimize token generation, while other workloads bring in very disparate prompts. Users can request multiple completions for the same prompt, and/or provide multiple prompt arrays for the same request. Both impact the total processing load associated with fulfilling the workload. Some users may send datasets requesting a streaming response focused on low latency time-to-first token, while other users may provide non-streaming workloads that emphasize throughput and overall lower execution time. Additional factors that can affect AI model performance include model fine-tuning functionality (e.g., providing users with the capability to fine-tune models for a specific context), changes in hardware performance over time (e.g., due to hardware updates), and/or compliance with data privacy and responsible AI standards.
Accordingly, the present disclosure describes techniques for predicting the processing unit capacity (e.g., GPU capacity) that will be used for fulfilling a given user workload. In general, input prompts of a user workload are formatted as some number of input tokens, where each input token corresponds to a word, series of characters, single character, or other subdivision of the input prompt. The input tokens expressing an input prompt are input to an AI model, which generates a response to the input prompt in the form of output tokens. In other words, the processing load used to fulfill a given workload is proportional both to the number of input tokens provided to the AI model, and also proportional to the number of output tokens generated by the model.
As such, the processing load to be used for a given workload may be predicted based on a sample workload with a certain distribution of input and output tokens, allowing the system to achieve predictable performance for similar user workloads. In general, the system may estimate the prompt load (e.g., processing load associated with interpreting input tokens) and generation load (e.g., processing load associated with generating output tokens) for a given workload. This may be used to estimate an inference unit (IU) processing load, representative of the overall processing load that will be placed on the processing unit while fulfilling the user's workload. Based on the IU processing load, fractional processing capacity of a processing unit may be allocated for fulfilling the workload.
In some examples, the techniques described herein can provide one or more of: a capability to predict fractional GPU capacity for a given workload to achieve service-level agreement (SLA) requirements; a capability to abstract physical hardware resources and networking topology to determine performance characteristics of AI models; an ability to understand the impact of different workloads on these AI models without conducting extensive benchmarking or proof of concepts; an ability to model additional sensitivity analysis by adding information (such as expected cache hit rate and/or prompt tuning) that influences the token generation rate; and/or an intelligent system that can recommend an ideal model and GPU cluster configuration to run a workload without human intervention.
As shown, the computing device is a component of a distributed AI inferencing platform 108 that is implemented by a plurality of different computing devices. In general, computing device 100, as well as the other computing devices described herein, has any suitable capabilities, hardware configuration, and form factor. Any or all of the computing devices described herein may in some cases be implemented as computing system 1200 described below with respect to
As used herein, an AI model is “implemented” by a computing device when parameters of the AI model are stored by, or otherwise accessible to, the computing device, such that the AI model can be executed by the processor of the computing device. This may include loading model parameters into random access memory (RAM) of the computing device, memory of a graphics processing unit (GPU), memory of a central processing unit (CPU), and/or other suitable volatile or non-volatile memory. The same computing device may implement any suitable number and variety of different AI models. In some examples, one or more of the AI models implemented by the computing device are instantiated via different virtual machines or containers hosted on the computing device. For instance, in one embodiment, each different AI model is instantiated in a different virtual machine or container hosted by the computing device.
In
The inference workload is transmitted to the distributed AI inferencing platform via a computer network 120. This takes the form of any suitable local or wide-area computer network, such as the Internet. As shown, the distributed AI inferencing platform additionally receives a plurality of other inference workloads 122A-C, which originate from other computing devices. It will be understood that the distributed AI inferencing platform may receive any suitable number of inference requests from any suitable number of different client computing devices, which may correspond to one or more different users of the distributed AI inferencing platform. For instance, in one scenario, the distributed AI inferencing platform is used by thousands or millions of different customers, each of whom pay for access to the hardware resources of the distributed AI inferencing platform, and each of whom provide inference workloads to the platform for fulfillment.
As discussed above, in some examples, the distributed AI inferencing platform is implemented by a plurality of different computing devices (e.g., server computers) working cooperatively. Different computing devices may, for instance, host different AI models, and/or have different hardware capabilities suitable for fulfilling different types of workloads. In
Thus, as discussed above, the distributed AI inferencing platform may allocate received inference workloads between different computing nodes and processing units according to any suitable criteria. In
As will be described in more detail below, the estimated prompt load and estimated generation load are “received” in any suitable way, from any suitable source. In some examples, “receiving” the estimated prompt load and estimated generation load may include calculating either or both of the estimated prompt load and estimated generation load based on characteristics of the AI inferencing workload, and/or based on characteristics of a sample workload. Additionally, or alternatively, the estimated prompt load and/or estimated generation load may be specified by the user supplying the AI inferencing workload.
In any case, based at least in part on the estimated prompt load and the estimated generation load, the workload allocation computing device estimates an inference unit (IU) processing load 136 to be applied to the processing unit while fulfilling the AI inferencing workload. As will be described in more detail below, the IU processing load may be estimated in various suitable ways, based on any suitable input data. As one non-limiting example, the IU processing load may be estimated using a statistical model.
In some examples, the estimated IU processing load may be output for additional processing and/or review. For instance, in some examples, the workload allocation computing device may transmit an indication of the IU processing load for display in a graphical user interface (GUI). One example GUI will be described below with respect to
Once the IU processing load is estimated, the computing device allocates fractional processing capacity of the processing unit for fulfilling the AI inferencing workload based at least in part on the IU processing load. In
It will be understood that the output data may be “output” in various suitable ways depending on the implementation. In some embodiments, outputting the output data includes passing the output data to a downstream application (e.g., for decoding), transmitting the output data to another computing device (e.g., to a client computing device, to another computing device of a distributed AI inferencing platform), writing the output data to a data file, storing the output data in non-volatile storage of the computing device, and/or storing the output data in an external storage device communicatively coupled with the computing device.
In
Estimation of an estimated prompt load and an estimated generation load (e.g., such as prompt load 132 and generation load 134 shown in
The sample workload may have any suitable source. In some examples, the sample workload is provided by the same user that also provided the AI inferencing workload for which fractional processing capacity is being allocated. For instance, the sample workload may include curated input prompts and corresponding output data that the user believes will be representative of future AI inferencing workloads that they expect to provide to the distributed AI inferencing platform. Additionally, or alternatively, the sample workload may include previous input prompts provided to the distributed AI inferencing platform in a prior inferencing request associated with the same user as the AI inferencing workload. In other words, the sample workload may include inference workloads previously fulfilled by the distributed AI inferencing platform for the user—e.g., prior to the user switching to a fractional allocation model.
Characteristics of the sample workload may be used to calculate the estimated prompt load and/or generation load in any suitable way. In the example of
Estimation of an IU processing load is schematically illustrated with respect to
The IU processing load may be estimated in any suitable way depending on the implementation, using any suitable type of estimation model. As one non-limiting example, the IU processing load may be estimated using a statistical model. Any suitable statistical model may be used. In some examples, the statistical model is a linear regression model. However, it will be understood that this need not only be done via linear regression-rather, as other examples, suitable boosted tree models, mathematical models, and/or trained models (e.g., a trained machine learning model that outputs an IU load estimate based on prompt and generation load) may be used to estimate the IU load for a given workload.
In some cases, the estimation model used to derive the IU processing load may be generated based on a current-pass token quantity and an input prompt index. In
The present disclosure has thus far primarily focused on a single inference workload, for which fractional capacity of one processing unit is allocated. However, as discussed above, the distributed AI inferencing platform may be implemented by any suitable number of different computing nodes (e.g., thousands of different server computers), each of which may include any suitable number of different processing units (e.g., GPU batches). Thus, the techniques described herein may be used to allocate a wide variety of different workloads to a plurality of different processing units, which may themselves be distributed between a plurality of different computing nodes of the distributed AI inferencing platform.
This process, referred to as “batch allocation,” is schematically illustrated with respect to
In this example scenario, prompts batch 400A includes a set of ten prompts that are 500 tokens long and cause generation of 200 output tokens each. Each GPU batch in this example has a finite capacity of generating 3000 output tokens, and thus GPU batch 404A can handle the ten prompts of batch 400A (e.g., accounting for 2000 output tokens). Prompts group 400B has a set of 20 prompts that are 2000 tokens long and cause generation of 100 output tokens each (this may be referred to as a “prompt-heavy” dataset). Half of these are assigned to GPU batch 404A (e.g., accounting for 1000 output tokens) to balance its token capacity, and the remainder of prompts batch 400B goes to GPU batch 404B (accounting for another 1000 output tokens). Prompt batch 400C has a set of 40 prompts that are 100 tokens long and cause generation of 500 output tokens each (this may be referred to as a “generation-heavy” dataset). Four of these input prompts (e.g., accounting for 2000 output tokens) are assigned to GPU batch 404B to balance its capacity, and another six input prompts (e.g., accounting for 3000 output tokens) are assigned to GPU batch 404C. Any unassigned input prompts may be rejected with an error indicating that the system is at capacity and cannot process additional prompts.
Various metrics may be used to illustrate performance characteristics of the above system. Time-to-first token (TTFT): This measures the time to generate the first token that is sent to the user. This can be particularly relevant for streaming requests, where the user expects the AI model to continuously send tokens as they are generated. As a corollary, time-to-last token (TTLT) is the time to generate the last token in the response.
Time-between-tokens (TBT): This measures the time between tokens during the token generation process. With rising load, TBT degrades, and generation rate reduces at high load. TBT is relevant for both streaming and non-streaming requests.
Utilization: Once all the GPU batches are at full token generation capacity with no more new batches possible (e.g., due to memory constraints), the system is at full utilization and any additional load may be rejected. In some examples, it is possible for the user to send a retry request once the load reduces to get a subsequent response via their client.
Latency variation: As load increases, and the GPU batches approach their generation capacity, TBT will no longer stay in a steady state. This is shown in
Prompt load: In some examples, such as cases where the input prompt has a relatively large number of input tokens, the index of the prompt can be relevant. For instance, TTFT when the prompt token size is 200 versus 20,000 can be vastly different. In other words, TTFT is proportional to the prompt load, which in turn is proportional to the number of input tokens in the AI inferencing workload, as discussed above. In this manner, for “prompt-heavy” workloads (e.g., those with a relatively large number of input tokens for inferencing), prompt load is higher, and TTFT is higher. In some examples, prompts are read sequentially but generation is parallelized up to the number of threads that can be spawned, which in turn is a function of available memory in the hardware. The prompt load for a given workload (e.g., a sample user workload) can be estimated by simulating different scenarios where different numbers of input tokens are provided to the same model, with the same output and batch sizes. Additionally, or alternatively, the prompt load may be derived from a user indication of the predicted number of input tokens that will be included in a typical input prompt.
Generation load: the TBT for a given workload increases with an increasing number of generated tokens. In other words, TBT is proportional to the generation load, which is the processing load applied to the GPUs during generation of output tokens as discussed above. This may arise from the nature of the workload itself and/or the number of concurrent requests. TBT can be non-linear, especially at higher utilization of the GPUs (e.g., as is shown in
IU processing load: Inference requests that cause generation of a relatively large number of tokens typically have a higher generation load. This can be exacerbated when the prompt load is also relatively high, causing a significant overall IU processing load. In other words, the IU load to fulfill a given inference request is based at least in part on the prompt token length, TTFT, TBT, and/or the expected number of generated tokens. To estimate the IU load, a set of simulations may be run with varying prompt and generation sizes at various utilization capacities (e.g., request concurrency impacts the load placed on the GPUs), while measuring the latency impact. In this manner, for a given AI model:
IU Load∝(Prompt load+Generation load+error)
This may be used to estimate the running inference-per-minute (IUPM) load of incoming requests that are going into the GPU.
Stable performance: Using the above relationship, it is possible to estimate the IU processing load that will be applied to GPUs of the distributed AI inferencing platform while fulfilling the user's AI inferencing workload. In some examples, a maximum batch size may be enforced. In some examples, the predicted IU load and maximum batch size may be set such that the stability for a given workload is improved, while latency is reduced. For instance,
Fractional capacity: The IU processing load estimation function can be used to provide an IU processing load calculator, enabling allocation of fractional GPU capacity as a function of incoming prompt load and generation load. This means that, given a user workload (e.g., prompt-heavy, generation-heavy, or balanced), it is possible to estimate the associated IU processing load. As such, GPU capacity can be provisioned as a logical unit, which may be referred to as a PTUv2 (Provisioned-Managed Throughput Unit), also referred to as a “PTU” for simplicity.
Running at a defined Provisioned-Managed Throughput can help to reduce the cost for a user, while improving the stability and reducing the latency with which the user's workloads are fulfilled. In some examples, if a user requires additional GPU capacity, they can then purchase additional PTUs as fractional GPU capacity. Notably, the remaining GPU capacity is available to fulfill other user workloads. In this manner, different workloads can be isolated from one another, which improves stability and reduces the impact that one user workload may have on other, concurrently fulfilled workloads. This can reduce the impact of “noisy neighbors”—e.g., scenarios where one user's workload is negatively impacted by the fact that resources of a shared pool are consumed by other user workloads.
In some examples, provisioning fractional GPU capacity in this manner can reduce GPU idle time, effectively increasing GPU utilization.
At 1102, method 1100 includes receiving an estimated prompt load and an estimated generation load of an AI inferencing workload. As discussed above, “receiving” the estimated prompt load and/or estimated generation load may include receiving indications of these values from a user, and/or calculating these values based on characteristics of the input workload to be fulfilled. For instance, as described above with respect to
At 1104, method 1100 includes estimating an IU processing load to be applied to a processing unit while fulfilling the AI inferencing workload. As discussed above with respect to
At 1106, method 1100 includes allocating fractional processing capacity of a processing unit based at least in part on the estimated IU processing load. In this manner, fractional capacity of the same processing unit can be used to concurrently fulfill multiple different user inferencing workloads.
To summarize, the techniques described herein can provide various technical improvements over other approaches to allocating processing capacity for inference requests. For instance, some approaches use a first-in-first-out (FIFO) queue and are memory bound. In such cases, regardless of the size of the incoming request, the system allocates the workload to a GPU batch based on the order in which it was received. By contrast, in the approach described herein, the system estimates the compute cost in IUs to process the request, based at least in part on the estimated prompt load and the estimated generation load. In some examples, this need not abide by a simple FIFO order, but rather can accept other requests in the incoming queue that may have lower IU load while rejecting the requests with higher loads.
The techniques described herein can improve performance stability and reduce latency by utilizing fractional GPU capacity in AI model inferencing. Users may reserve model inferencing capacity to run inferencing on high volume- and/or latency-sensitive workloads. The cost of an inferencing request may be estimated in IUs, which may be used to allocate fractional GPU capacity. For instance, IUs may be mapped to PTUs (e.g., in a one-to-one mapping, or other suitable mapping), which may be purchased by customers of the distributed AI inferencing platform. This can beneficially provide consistent latency and throughput for workloads with consistent characteristics, such as prompt token size, completion size, and concurrent requests. Notably, as discussed, fractional allocations of the same processing unit are isolated from one another, such that one user's workload is not negatively affected by another user's workload fulfilled on the same processing node (e.g., avoiding the “noisy neighbor” phenomenon described above). In other words, in some examples, the AI inferencing workload is fulfilled with an inferencing latency and an inferencing latency variability that is isolated from other AI inferencing workloads concurrently fulfilled by the processing unit. Once allocated, PTUs can be used by users to create AI model deployments on the platform. PTUs are logical constructs that are independent of the underlying hardware infrastructure and network topologies.
For Provisioned-Managed deployments, users can monitor the number of PTUs used at any given time. This deterministic approach may simplify capacity planning and deployment monitoring while providing consistent latency and lower costs for the users. This beneficially simplifies the application development process and enhances user experience by providing flexibility to scale up or down as needed without the constraints of a rigid quota system. On the platform side, it simplifies the complexities of managing AI models and meeting user needs.
The techniques described herein can additionally provide an ability to understand the impact of different workloads on different AI models, without conducting extensive benchmarking or proof of concepts. For instance, in some examples, estimating the IU processing load for inferencing requests supplied to a particular selected AI model may include estimating workload-related performance characteristics for the selected AI model. This may beneficially reduce GPU idle time and improve utilization. Fractional GPU capacity of the same GPU batch can be allocated to different users, where different user workloads are isolated from one another and do not cause significant performance impacts for one another. Additional sensitivity analysis may be modeled by adding information (e.g., expected cache hit rate, prompt customization) that influences the token generation rate.
In some examples, users may be provided with an IU calculator that can estimate the amount of GPU capacity (e.g., in PTUs) that will be consumed by the user's workloads. In some examples, the system may provide a recommendation module that monitors a user's resource usage over time, and then suggests a number of PTUs that should be provisioned by the user to run their workloads while reducing costs and reducing latency. As one example, allocating fractional processing capacity of the processing unit may include predicting that a financial cost to the user of fulfilling the AI inferencing workload using the fractional processing capacity is less than a financial cost to the user of fulfilling the AI inferencing workload under the shared pool allocation plan. For instance, in some examples, a “sample workload” used to estimate prompt load and/or generation load may include input prompts previously submitted by the user to the distributed AI inferencing platform, even if such prompts were submitted before the user took advantage of fractional GPU capacity allocation. For instance, in one example, the sample workload may include a prior inferencing request provided to the distributed AI inferencing platform under a shared pool allocation plan.
In some embodiments, the methods and processes described herein may be tied to a computing system of one or more computing devices. In particular, such methods and processes may be implemented as a computer-application program or service, an application-programming interface (API), a library, and/or other computer-program product.
Computing system 1200 includes a logic processor 1202 volatile memory 1204, and a non-volatile storage device 1206. Computing system 1200 may optionally include a display subsystem 1208, input subsystem 1210, communication subsystem 1212, and/or other components not shown in
Logic processor 1202 includes one or more physical devices configured to execute instructions. For example, the logic processor may be configured to execute instructions that are part of one or more applications, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more components, achieve a technical effect, or otherwise arrive at a desired result.
The logic processor may include one or more physical processors configured to execute software instructions. Additionally or alternatively, the logic processor may include one or more hardware logic circuits or firmware devices configured to execute hardware-implemented logic or firmware instructions. Processors of the logic processor 1202 may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and/or distributed processing. Individual components of the logic processor optionally may be distributed among two or more separate devices, which may be remotely located and/or configured for coordinated processing. Aspects of the logic processor may be virtualized and executed by remotely accessible, networked computing devices configured in a cloud-computing configuration. In such a case, these virtualized aspects are run on different physical logic processors of various different machines, it will be understood.
Non-volatile storage device 1206 includes one or more physical devices configured to hold instructions executable by the logic processors to implement the methods and processes described herein. When such methods and processes are implemented, the state of non-volatile storage device 1206 may be transformed—e.g., to hold different data.
Non-volatile storage device 1206 may include physical devices that are removable and/or built in. Non-volatile storage device 1206 may include optical memory, semiconductor memory, and/or magnetic memory, or other mass storage device technology. Non-volatile storage device 1206 may include nonvolatile, dynamic, static, read/write, read-only, sequential-access, location-addressable, file-addressable, and/or content-addressable devices. It will be appreciated that non-volatile storage device 1206 is configured to hold instructions even when power is cut to the non-volatile storage device 1206.
Volatile memory 1204 may include physical devices that include random access memory. Volatile memory 1204 is typically utilized by logic processor 1202 to temporarily store information during processing of software instructions. It will be appreciated that volatile memory 1204 typically does not continue to store instructions when power is cut to the volatile memory 1204.
Aspects of logic processor 1202, volatile memory 1204, and non-volatile storage device 1206 may be integrated together into one or more hardware-logic components. Such hardware-logic components may include field-programmable gate arrays (FPGAs), program- and application-specific integrated circuits (PASIC/ASICs), program- and application-specific standard products (PSSP/ASSPs), system-on-a-chip (SOC), and complex programmable logic devices (CPLDs), for example.
The terms “module,” “program,” and “engine” may be used to describe an aspect of computing system 1200 typically implemented in software by a processor to perform a particular function using portions of volatile memory, which function involves transformative processing that specially configures the processor to perform the function. Thus, a module, program, or engine may be instantiated via logic processor 1202 executing instructions held by non-volatile storage device 1206, using portions of volatile memory 1204. It will be understood that different modules, programs, and/or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine may be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms “module,” “program,” and “engine” may encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.
When included, display subsystem 1208 may be used to present a visual representation of data held by non-volatile storage device 1206. The visual representation may take the form of a graphical user interface (GUI). As the herein described methods and processes change the data held by the non-volatile storage device, and thus transform the state of the non-volatile storage device, the state of display subsystem 1208 may likewise be transformed to visually represent changes in the underlying data. Display subsystem 1208 may include one or more display devices utilizing virtually any type of technology. Such display devices may be combined with logic processor 1202, volatile memory 1204, and/or non-volatile storage device 1206 in a shared enclosure, or such display devices may be peripheral display devices.
When included, input subsystem 1210 may comprise or interface with one or more user-input devices such as a keyboard, mouse, touch screen, camera, or microphone.
When included, communication subsystem 1212 may be configured to communicatively couple various computing devices described herein with each other, and with other devices. Communication subsystem 1212 may include wired and/or wireless communication devices compatible with one or more different communication protocols. As non-limiting examples, the communication subsystem may be configured for communication via a wired or wireless local- or wide-area network, broadband cellular network, etc. In some embodiments, the communication subsystem may allow computing system 1200 to send and/or receive messages to and/or from other devices via a network such as the Internet.
In an example, a method for artificial intelligence (AI) inferencing workload allocation comprises: at a computing device of a distributed AI inferencing platform, receiving an estimated prompt load and an estimated generation load of an AI inferencing workload to be fulfilled by a processing unit of a computing node of the distributed AI inferencing platform; based at least in part on the estimated prompt load and the estimated generation load, estimating an inference unit (IU) processing load to be applied to the processing unit while fulfilling the AI inferencing workload; and allocating fractional processing capacity of the processing unit for fulfilling the AI inferencing workload based at least in part on the IU processing load. In this example or any other example, the estimated prompt load is estimated based at least in part on an input prompt token quantity of a sample workload associated with a same user as the AI inferencing workload. In this example or any other example, the estimated generation load is estimated based at least in part on a quantity of output tokens generated for the sample workload. In this example or any other example, the sample workload includes a previous input prompt provided to the distributed AI inferencing platform in a prior inferencing request associated with the same user as the AI inferencing workload. In this example or any other example, a current-pass token quantity and an input prompt index are input to a statistical model to estimate an amount of time used to process each input token of a sample input prompt, and wherein the IU processing load is proportional to the amount of time used to process each input token of the sample input prompt. In this example or any other example, the fractional processing capacity of the processing unit is allocated as a fractional capacity allocation, and wherein a size of the fractional capacity allocation relative to a total processing capacity of the processing unit is proportional to the IU processing load. In this example or any other example, the fractional processing capacity is allocated as a first fractional capacity allocation, and wherein a second fractional capacity allocation of the processing unit is allocated for concurrently fulfilling a second AI inferencing workload associated with a different user. In this example or any other example, the AI inferencing workload is fulfilled with an inferencing latency and an inferencing latency variability that is isolated from other AI inferencing workloads concurrently fulfilled by the processing unit. In this example or any other example, the method further comprises outputting an indication of an observed inferencing load used while fulfilling the AI inferencing workload. In this example or any other example, the AI inferencing workload is fulfilled by an AI model deployed in the distributed AI inferencing platform, and wherein the method further comprises outputting a deployment-level processing load summary indicating a deployment-level processing load used for fulfilling inferencing requests provided to the AI model. In this example or any other example, the AI inferencing workload is associated with a selected AI model, and wherein estimating the IU processing load includes estimating workload-related performance characteristics for the selected AI model. In this example or any other example, the method further comprises transmitting an indication of the IU processing load for display in a graphical user interface (GUI).
In an example, a computing device comprises: a processor; and a storage device holding instructions executable by the processor to: receive an estimated prompt load and an estimated generation load of an AI inferencing workload to be fulfilled by a processing unit of a computing node of a distributed AI inferencing platform; based at least in part on the estimated prompt load and the estimated generation load, estimate an inference unit (IU) processing load to be applied to the processing unit while fulfilling the AI inferencing workload; and allocate fractional processing capacity of the processing unit for fulfilling the AI inferencing workload based at least in part on the IU processing load. In this example or any other example, the estimated prompt load is estimated based at least in part on an input prompt token quantity of a sample workload associated with a same user as the AI inferencing workload. In this example or any other example, the estimated generation load is estimated based at least in part on a quantity of output tokens generated for the sample workload. In this example or any other example, the sample workload includes a previous input prompt provided to the distributed AI inferencing platform in a prior inferencing request associated with the same user as the AI inferencing workload. In this example or any other example, the fractional processing capacity is allocated as a first fractional capacity allocation, and wherein a second fractional capacity allocation of the processing unit is allocated for concurrently fulfilling a second AI inferencing workload associated with a different user. In this example or any other example, the instructions are further executable to transmit an indication of the IU processing load for display in a graphical user interface (GUI). In this example or any other example, the fractional processing capacity of the processing unit is allocated as a fractional capacity allocation, and wherein a size of the fractional capacity allocation relative to a total processing capacity of the processing unit is proportional to the IU processing load.
In an example, a method for artificial intelligence (AI) inferencing workload allocation comprises: at a computing device of a distributed artificial intelligence (AI) inferencing platform, estimating an estimated prompt load and an estimated generation load of an AI inferencing workload to be fulfilled by a processing unit of a computing node of the distributed AI inferencing platform, based at least in part on a sample workload associated with a same user as the AI inferencing workload; based at least in part on the estimated prompt load and the estimated generation load, estimating an inference unit (IU) processing load via a statistical model, the IU processing load to be applied to the processing unit while fulfilling the AI inferencing workload; and allocating fractional processing capacity of the processing unit for fulfilling the AI inferencing workload based at least in part on the IU processing load.
“And/or” as used herein is defined as the inclusive or V, as specified by the following truth table:
It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.
The subject matter of the present disclosure includes all novel and non-obvious combinations and sub-combinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.
This application claims priority to U.S. Provisional Patent Application No. 63/598,293, filed Nov. 13, 2023, the entirety of which is hereby incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63598293 | Nov 2023 | US |