This disclosure relates generally to effective deployment of machine-learning applications and more particularly to deploying machine-learning applications in client-server architectures with a plurality of processing units.
Real-time machine-learning (“ML”) applications, unlike typical web applications, are heavily reliant on computational processing with a high degree of computational complexity and few (or no) external requests. Upon receiving a request, the application performs a sequence of processing-intensive routines such as payload parsing, validation, feature engineering, model predictions, feature importance calculations, and so forth, all of which generally do not depend on resource calls beyond the server process memory.
In one example, a machine-learning application can process a single request in about 65-75 milliseconds executed on a single processing unit (e.g., a processor or central processing unit (CPU)). Assuming no system or resource contention, this means it might be expected to maintain approximately 70 ms latency until a throughput of about 14 requests per second (1000 ms/70 ms) with only 1 process executing the application, after which a higher throughput would result in higher latencies due to connections waiting in queue. In addition, additional processing units supporting additional workers may be expected to linearly grow the number of requests that can be processed per second, such that 1 worker process should support up to 14 requests per second (rps), 2 worker processes should support up to 28 rps, and so forth.
In practice, these worker processes may implement ML applications in various technologies, such as Python, using an HTTP server framework to deploy the application with multiple processes executing on multiple processors. However, these deployments often do not achieve the expected throughput. In certain experiments (with 64 processing units), the performance of the application on standard server implementations starts to degrade above a received request rate of throughput of 6 rps and exceeds 1 s latency after a throughput of 17 rps, regardless of how many worker processes are deployed for the application. This unexpected behavior yields low throughput for these AI applications deployed in client-server architectures and may lead to significant overprovisioning of servers or reduced efficiency of deployed systems. As real-time use cases can experience a wide range of request rates, from hundreds per day for larger adjudication tasks to millions smaller transaction scoring, it is essential to improve effective throughput of a ML application on server systems.
In many cases, the inefficient performance of these ML processes may be attributable to contention for processing unit time. That is, the worker processes of the ML application may be migrated from one processing unit of the server system to another processing unit as the operating system schedules execution of the various active processes to be executed. As a result, the worker process may incur significant overhead due to the transfer of the application's state and relevant data to another processing unit and other associated overhead. As additional worker processes are added, these worker processes may also be migrated to different processing units during execution of the worker process. In addition, this issue particularly occurs for ML applications (and not for other web server applications) because these applications typically require a relatively high amount of processing with few or no external calls. That is, the processing required by these applications is both high and affected by inefficient use of processing units.
To address this problem, rather than permit the worker processes to execute on any of the processing units (i.e., the plurality of processing units present on the server), a subset of the processing units are designated as eligible for processing each of the worker processes. For example, each worker process may be designated a single processing unit for processing that worker process. Further, the subset of processing units may be mutually exclusive, such that each worker process does not share a processing unit with another. While the operating system may still schedule other processes for execution on the processing units eligible for processing the worker processes, this approach reduces overhead associated with instruction and data migration of the worker processes to different processing units. In certain experimental results discussed below, specifying a subset of eligible processing units for the worker processes of the ML application increases throughput by five times relative to a baseline in which the worker processes may be processed by any processing units. This approach thus provides a simple way to increase throughput while also reducing compute costs on a single machine.
The figures depict various embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.
In general, the application provided by the ML server system 100 may implement a machine-learning model with parameters trained by a set of training data and may typically include thousands, millions, or billions of parameters to be applied by the model to process an input to a resulting output. Various additional analyses before and after application of the model may also be applied in different embodiments. For example, the input may be evaluated to determine whether it is suitable for application to the trained ML model (e.g., whether the input is of a similar distribution to the data used to train the model), and the output may be analyzed to determine feature importance to the output according to the model application or generate an individual conditional expectation (ICE) plot visualizing the effect of feature variation on the model input. In general, the machine-learned model application 106 may be relatively complex and computationally intensive, such that providing real-time inference (i.e., output predictions) in a timely way primarily depends on application of the model parameters to the input. Relative to many server-based applications, a ML model application 106 typically has relatively few database or other external functional calls, such that may process existing data for a relatively long period of time before requiring communication with or awaiting a reply from another system. As an example in the present disclosure, the ML application receives an input from the client device 120 and executing the ML model application 106 is expected to require approximately 65-75 milliseconds (ms) of processing by a processing unit.
Although the client device 120 and ML server system 100 are shown connected through a network 110 in
The network 110 provides a communication channel between the client device 120 and ML server system 100. The network 110 may include wired and wireless communication channels and protocols, and in typical embodiments may implement network addressing and routing with any suitable technologies, such as network layer addressing with Internet Protocol (IP) addresses and transport layer protocols such as Transmission Control Protocol (TCP) or User Datagram Protocol (UDP). In general, the client device 120 sends a request to the ML server system 100 by specifying a destination network address, port, and transport protocol and including the network address and local port of the client device 120. The ML server system 100 may then establish connection to the client device 120 as a socket defined by the respective addresses, ports, and protocol through which data is sent and received between the client device 120 and ML server system 100.
The ML server system 100 includes a plurality of processing units 108 that are managed by an operating system 104. In various embodiments, the processing units 108 may include processors that primarily execute sequenced instructions, such as a central processing unit (CPU), and may include processors specialized for distributed or matrix operations, such as a graphics processing units (GPU). In the examples and experiments discussed below, the ML server system 100 includes sixty-four (64) CPUs as processing units 108 that may each be assigned a process to execute. In additional embodiments, the number and type of processing units 108 may differ.
Each processing unit 108 is generally configured to execute a process described by a set of instructions and operating on a set of data registers. Each processing unit 108 may also include various cache levels that describe proximity of the relevant data to the data registers operated on by the processing unit. For example, a “level 1” cache is typically closer to the processing unit than a “level 2” cache but is typically smaller in size. While processing the instructions, relevant data may be fetched and located at particular local data caches based on the frequency of data access or likelihood that an instruction will reference that data.
The operating system 104 manages the execution of a number of active processes by the processing units 108. Each process may be identified by a process identifier and the operating system 104 may maintain a set of data related to execution of the process, termed a process control block. The process control block may specify various types of information about the associated process that differs in various embodiments. This information may include a state of the process (running, waiting, terminated), register data (for waiting processes), memory allocation data, scheduling data (priority information) and so forth. The process control block may also maintain information about the amount of time a process has run on a processing unit 108 and the amount of time that a process may be paused (e.g., stalled) from running. That is, the amount of time that a process may be eligible to run but that the process has actually been scheduled to run. In particular, the process control block may also include a list of eligible processing units permitted to run each process. Typically, the operating system 104 designates all processes as eligible to run on all processing units. For typical server applications that may require relatively limited processing, execution by any processing unit 108 enables swift resolution of the client request, or a limited processing time is sufficient to reach a waiting point at which the process is blocked while for a call to another function or system. As discussed further below, this behavior for ML applications can lead to inefficient allocation of computing resources that lead to significantly reduced throughput relative to the expected throughput of an application.
The operating system 104 uses the information from the process control block to schedule execution of the active processes at the processing units 108. The particular assignment of an active process to a particular processing unit varies in different operating systems 104. In general, the operating system 104 may only assign the process to one of the eligible processing units specified in the process control block. The operating system 104 may also change which process is executing based on various priorities, such as the priority level of a process, how long a process has been waiting to execute, how long a process has been continuously executing, and to prevent processes from starving (i.e., “active” processes that receive zero processing for an excess amount of time).
The machine-learning application manager 102 initiates and manages execution of the machine-learning model application 106. In one embodiment, the ML application manager 102 is a script that may be initiated by the ML server system 100 on startup or when the ML server system 100 is signaled to receive user requests related to ML model application 106. The ML application manager 102 may instruct the operating system 104 to begin executing the machine-learning model application 106. In execution, the ML model application 106 may initiate additional processes for processing client requests, for example by requesting to fork the process from the operating system 104. The initial process started for the ML model application 106 may be referred to as a parent process, and additional process(es) may be referred to as worker process(es). The operating system 104 may then have a number of active processes associated with the ML model application 106. The ML application manager 102 instructs the operating system 104 to modify the eligible processing units for processes of the ML model application 106 as further discussed below.
When the parent process 220 is executed, it initializes a socket 250 for use by the ML application. The initialization process may differ in various embodiments and different operating systems 210, and may include creating a socket with the operating system, binding the socket to an address and/or port and indicating that the socket is a listening/passive socket to receive incoming connection requests. Individual user requests are typically handled by a number of worker processes 230 that are initialized by the parent process 220 forking worker process 255 for the operating system 210 to initiate 260 a new child process that is configured to operate as a worker process 230. The worker process 230 may then connect 265 to the socket to accept incoming connection requests. When the worker process 230 is initiated with the socket created by the parent process 220, the worker process may inherit the same socket characteristics, enabling multiple worker processes 230 to receive requests from the same socket. Although not shown in
In this example, the parent process 220 creates the worker processes 230 in advance of the receipt of a user connection (i.e., pre-forking the worker process 230); in other embodiments, the parent process 220 may connect to incoming requests for the socket and fork worker processes 230 (e.g., up to a maximum number of worker processes) as the requests are received.
When a request is received, the requested connection 265 is completed and the worker process 230 is unblocked and may process the request using the ML model. The unblocked process may then be executed on the eligible processing units. In some embodiments, after completing processing of a request, the worker process 230 connects 265 to receive another connection to a device.
In one or more embodiments, the ML application 200 assigns eligible processing units to the worker processes 230. In some embodiments, the ML application manager 200 may obtain 270 the process IDs from the operating system 210 for the worker processes 230 and/or parent process 220 to specify the process IDs for which the eligible processing units are assigned 275. The eligible processing units may be specified to the operating system 210 in various ways depending on the operating system 210 and may include calling the “taskset” function from the operating system 210. The operating system 210 may then apply the assigned processing units to the applicable worker processes 230.
The ML application manager 200 specifies a subset of the computing units for each worker process as the eligible computing units on which the worker process may be executed. In one embodiment, the eligible computing units are mutually exclusive subsets of the plurality of computing units on the system. In addition, in some embodiments, a single computing unit is assigned as the eligible computing unit for each worker process 230, such that each worker process 230 may execute only on the specified eligible computing unit. For example, a first worker process may be assigned to a first computing unit as the eligible computing unit, a second working process may be assigned to a second computing unit, and so forth. In this example, the number of worker processes may be the same as the number of computing units, such that there may be a one-to-one correspondence between each worker process and each computing unit. As such, when the operating system schedules execution of the various active processes, the worker processes 230 may be prevented from contending with one another for a particular computing unit and may also be prevented from migrating from one computing unit to another. As shown in the experimental examples below, when a large number of computing units are available for executing the worker processes, the ability to change processing of the worker processes to a large number of processing units, typically advantageous for online servers, becomes a hindrance to effective throughput.
In the example of
In the example of
The application running on default computing unit eligibility breaches the 1000 ms p99 response time 500 at just 17 requests per second as shown by a first line 510. In stark contrast, assigning a one-to-one assignment of computing unit to worker process does not breach the 1000 ms response time as shown by a second line 520 until between 650-700 requests per second. This means that the same application running on the same system can realize a 38× (650/17) higher throughput with this optimization.
The task-clock performance counter represents the total CPU time (ms) that the PID utilized while in running state during its lifetime. In this case, the runtime of the assigned worker processes in processing all 1,000 requests required 62% less CPU time under the new method.
The context-switches performance counter represents how many times a PID was swapped from running state by the process scheduler during its lifetime. In this case, there were 95% less context switches under the new method. This may be a significant driver of performance degradation in the standard method because context switches are “pure overhead” as computing unit cycles are spent on the process of saving and loading a PID's context variables and swapping it to and from states (running, waiting, ready) rather than doing useful work for the application.
The cpu-migrations performance counter represents total number of times a PID was migrated to a different CPU during a context switch throughout its lifetime. It is not surprising that the new method shows 0 migrations because each PID is eligible to be run only on its own computing unit. However, this can also be another significant driver of performance degradation in the standard method because when the process scheduler swaps a PID out from ready state back to running state onto a new computing unit, that computing unit generally does not have that PID's instructions & data in its local caches (L0-L1). In this case there would be a cache miss, and data needs to be read from higher level caches (L2-L4) or worst case from disk. These cache misses add significant latency. With the new method, the data & instruction are guaranteed to be in the local caches and hence this locality speeds up the runtime.
The page-faults performance counter indicates when the data required by the process is not in main memory (L4) and had been paged out to disk by the virtual memory (VM) system earlier and needs to be paged in. Paging in from disk is significantly slower than fetching from memory and hence adds latency. The new method experiences 46% less page faults.
Accordingly, assigning computing units to worker processes is a system level configuration that is cheap to implement and provides surprising improvement to implementations that allow workers to execute on any computing unit in a multi-computing unit environment. As shown in the specific experiments above, this optimization was able to increase throughput for our application by 38× over the baseline on the same amount of resources, or alternatively match the throughput of the baseline on only 20% of the resources required by the baseline application.
The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
Embodiments of the invention may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
This application claims the benefit of provisional U.S. Application No. 63/541,963, filed Oct. 2, 2023, the contents of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63541963 | Oct 2023 | US |