The project leading to this application has received funding from the Clean Sky 2 Joint Undertaking under the European Union's Horizon 2020 research and innovation programme under grant agreement No. 945535.
Real-time processing in dynamic environments requires processing large sums of data in very short timeframes. Depending on the particular context, such processing may involve computing iterative mathematical calculations or performing intensive data analysis. Fast and accurate data output is important for avoiding processing delays, which is especially imperative for safety-critical or mission-critical applications, such as those used in avionics.
Some real-time operating systems utilize a time and/or space partitioning process for processing data. Initially, tasks are executed at a main processor (referred to herein as a “central processing unit” or “CPU”) according to instructions from an application. The CPU is generally responsible for directing the execution of tasks along with managing data output as the CPU executes the task. Much of the raw data processing for the tasks received at the CPU is performed by a coprocessor distinct from the CPU. When the CPU executes tasks, it can assign workloads associated with the task to the coprocessor for processing. A “workload” is also referred to herein as a “job,” “kernel,” or “shader” for specific applications. A task executed by the CPU may require processing that could be more quickly executed on a coprocessor, so the CPU can send one or more requests that define the workloads that the coprocessor must execute to complete the task executed by the CPU. These requests are referred to herein as “workload launch requests.”
A coprocessor typically receives many such requests, sometimes over a short period of time. Each request may involve a very large number of intensive calculations. The ability to process workload launch requests in a timely manner depends not only on the processing capabilities of the coprocessor, but how the coprocessor is utilized to execute work requested by the main processor. While coprocessors with powerful processing resources can process these requests quickly, they can be expensive to implement with no guarantee that the coprocessor is capable to process tasks with substantial processing requirements in a short timeframe. Less advanced coprocessors with limited processing resources are prone to processing delays associated with insufficient bandwidth to process additional requests and may lead to loss of guarantees to determinism. In any case, the coprocessor becomes overwhelmed with backed up workload launch requests.
Some coprocessors enable time and/or space partitioning of its processing resources so that multiple jobs can be executed in parallel. However, conventional coprocessors do not provide sufficient spatial isolation, time determinism, and responsiveness to execute multiple safety-critical applications simultaneously. Failure to timely process safety-critical applications can ultimately lead to loss of guarantees to determinism.
The details of one or more embodiments are set forth in the description below. The features illustrated or described in connection with one exemplary embodiment may be combined with the features of other embodiments. Thus, any of the various embodiments described herein can be combined to provide further embodiments. Aspects of the embodiments can be modified, if necessary to employ concepts of the various patents, applications and publications as identified herein to provide yet further embodiments.
In one embodiment, a processing system is disclosed. The processing system comprises a processor and a coprocessor configured to implement a processing engine. The processing system further comprises a processing engine scheduler configured to schedule workloads for execution on the coprocessor. The processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing on the processor. In response, the processing engine scheduler is configured to generate at least one launch request for submission to the coprocessor based on a coprocessor scheduling policy. Based on the coprocessor scheduling policy, the processing engine scheduler selects which coprocessor clusters are activated to execute workloads identified by a queue based on the at least one launch request. The coprocessor scheduling policy defines at least one of: tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to immediately execute on the coprocessor within time of a timing window in which the one or more tasks are being executed on the processor, or tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to execute on the coprocessor based on an order of priority and either: with respect to an external event common to both the processor and coprocessor, or during a subsequent timing window after the time of the timing window in which the one or more tasks are being executed on the processor.
In another embodiment, a coprocessor is disclosed. The coprocessor is configured to be coupled to a processor and configured to implement a processing engine. The coprocessor comprises at least one cluster configured to execute workloads. The coprocessor comprises a processing engine scheduler configured to schedule workloads for execution on the coprocessor. The processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing on the processor. In response, the processing engine scheduler is configured to generate at least one launch request for submission based on a coprocessor scheduling policy. Based on the coprocessor scheduling policy, the processing engine scheduler is configured to select which of the at least one cluster is activated to execute workloads identified by a queue comprising the at least one launch request. The coprocessor scheduling policy defines at least one of: tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to immediately execute on the coprocessor within time of a timing window in which the one or more tasks are being executed on the processor, or tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to execute on the coprocessor based on an order of priority and either: with respect to an external event common to both the processor and coprocessor, or during a subsequent timing window after the time of the timing window in which the one or more tasks are being executed on the processor.
In another embodiment, a method is disclosed. The method comprises receiving one or more workload launch requests from one or more tasks executing on a processor. The one or more workload launch requests include one or more workloads configured for execution on a coprocessor. The method comprises generating at least one launch request in response to the one or more workload launch requests based on a coprocessor scheduling policy. The method comprises scheduling one or more workloads identified in the at least one launch request for execution on the coprocessor based on the coprocessor scheduling policy by at least one of: a tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to immediately execute on the coprocessor within time of a timing window in which the one or more tasks are being executed on the processor, or a tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to execute on the coprocessor based on an order of priority and either: with respect to an external event common to both the processor and coprocessor, or during a subsequent timing window after the time of the timing window in which the one or more tasks are being executed on the processor.
In another embodiment, a processing system is disclosed. The processing system comprises a processor and a coprocessor configured to implement a processing engine. The processing system comprises a processing engine scheduler configured to schedule workloads for execution on the coprocessor. The processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing or executed on the processor. In response, the processing engine scheduler is configured to generate at least one launch request for submission to the coprocessor. The coprocessor comprises a plurality of compute units and at least one command streamer associated with one or more of the plurality of compute units. Based on a coprocessor assignment policy, the processing engine scheduler is configured to assign for a given execution partition, via the at least one command streamer, clusters of compute units of the coprocessor to execute one or more workloads identified by the one or more workload launch requests as a function of workload priority. The coprocessor assignment policy defines at least: an exclusive assignment policy wherein each workload is executed by a dedicated cluster of compute units; an interleaved assignment policy wherein each workload is exclusively executed across all compute units of the clusters of compute units; a policy-distributed assignment policy wherein each workload is individually assigned to at least one cluster of the clusters of compute units and an execution duration during the given execution partition; or a shared assignment policy wherein each workload is non-exclusively executed by the clusters of compute units each concurrently executing multiple workloads.
In another embodiment, a coprocessor is disclosed. The coprocessor is configured to be coupled to a processor and configured to implement a processing engine. The coprocessor comprises a plurality of compute units each configured to execute workloads. The coprocessor comprises a processing engine scheduler configured to assign workloads for execution on the coprocessor. The processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing or executed on the processor. In response, the processing engine scheduler is configured to generate at least one launch request for submission to the coprocessor. The coprocessor comprises at least one command streamer associated with one or more of the plurality of compute units. Based on a coprocessor assignment policy, the processing engine scheduler is configured to assign for a given execution partition, via the at least one command streamer, clusters of compute units of the coprocessor to execute one or more workloads identified by the one or more workload launch requests as a function of workload priority. The coprocessor assignment policy defines at least: an exclusive policy wherein each workload is executed by a dedicated cluster of the clusters of compute units; an interleaved policy wherein each workload is exclusively executed across all compute units of at least one cluster of the clusters of compute units; a policy-distributed policy wherein each workload is individually assigned to at least one cluster of the clusters of compute units and an execution duration during the given execution partition; a shared policy wherein each workload is non-exclusively executed by the clusters of compute units each concurrently executing multiple workloads.
In another embodiment, a method is disclosed. The method comprises receiving one or more workload launch requests from one or more tasks executing or executed on a processor. The one or more workload launch requests include one or more workloads configured for execution on a coprocessor. The method comprises generating at least one launch request in response to the one or more workload launch requests. The method comprises assigning clusters of compute units of the coprocessor to execute one or more workloads identified in the one or more workload launch requests as a function of workload priority based on a coprocessor assignment policy. The coprocessor assignment policy defines at least: an exclusive policy wherein each workload is executed by a dedicated cluster of the clusters of compute units; an interleaved policy wherein each workload is exclusively executed across all compute units of at least one cluster of the clusters of compute units; a policy-distributed policy wherein each workload is individually assigned to at least one cluster of the clusters of compute units and an execution duration during a given execution partition; a shared policy wherein each workload is non-exclusively executed by the clusters of compute units each concurrently executing multiple workloads.
Understanding that the drawings depict only exemplary embodiments and are not therefore to be considered limiting in scope, the exemplary embodiments will be described with additional specificity and detail through the use of the accompanying drawings, in which:
In accordance with common practice, the various described features are not drawn to scale but are drawn to emphasize specific features relevant to the exemplary embodiments.
In the following detailed description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific illustrative embodiments. However, it is to be understood that other embodiments may be utilized and that logical, mechanical, and electrical changes may be made. Furthermore, the method presented in the drawing figures and the specification is not to be construed as limiting the order in which the individual steps may be performed. The following detailed description is, therefore, not to be taken in a limiting sense.
Embodiments of the present disclosure provide improvements to scheduling and assignment of workload(s) to a coprocessor (for example, a GPU) for execution. Some embodiments disclosed herein enable workload(s) to be scheduled to a GPU based on a timing window of the CPU so that the GPU is at least partially synchronized with the CPU. Other embodiments enable a GPU to dynamically assign workload(s) to optimize use of the processing resources on the GPU. Workloads may be pedagogically referred to herein in the singular “workload” or plural “workloads” understanding that the description applies to either a single workload or multiple workloads unless otherwise stated.
While some examples are illustrated and described for specifically scheduling and assigning workloads to a GPU, the examples described herein are also applicable in the context of other systems. For example, such techniques are also applicable to any processing system having one or more processors that schedules and assigns workloads to one or more coprocessors. The coprocessor can generally be implemented as an integrated or discrete processing unit in a processing system. In various examples, the coprocessor can be implemented as a graphical processing unit (“GPU”), a neural processing unit (“NPU”), a data processing unit (“DPU”), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), and other processing circuitry, or a combination thereof.
The coprocessor may accelerate workload processing using traditional execution or by artificial intelligence facilitated execution. For AI-based modeling, the coprocessor is used to accelerate execution of some of the workloads associated with a machine learning (ML)/artificial intelligence (AI) application. Additionally, the coprocessor can be used to accelerate execution of a ML/AI application with an inference engine, which can be utilized for deep neural network (DNN) processing, for example. In various figures and description that follows, the coprocessor is implemented as a GPU for pedagogical explanation.
Processor 104 is configured to receive system parameters from an offline system 102 (for example, from a stored system configuration), including a coprocessor scheduling policy that determines when workloads are assigned to the coprocessor 106 and a coprocessor assignment policy that determines where workloads are assigned to processing resources of the coprocessor 106. Processor 104 is also configured to execute tasks 105 received from one or more applications (safety-critical applications, best-effort applications, etc.) running on processing resources (processors, processing circuitry) of processor 104 (not shown in
Processor 104 may include one or more partitions 103. Each partition 103 functions as an independent processing system (for example, a processing core 103 as shown in
Each coprocessor 106 coupled to one or more processors 104 is configured to receive at least some of the processing offloaded by the processors 104. System 100 includes a driver 108 including a processing engine scheduler (also referred to as “scheduler”) 109 and one or more contexts 110. The context 110 includes hardware configured to provide spatial isolation. Multiple contexts 110 enable execution of multiple partitions on the coprocessor in parallel to support time and/or space partitioning.
For artificial intelligence processing models such as a neural network, the scheduler 109 can be an inference engine scheduler that utilizes inference processing to schedule workloads for execution on the coprocessor 106. The driver 108 and coprocessor 106 can utilize multiple types of processing, including computing and rendering. In one example where system 100 is a RTOS, the driver 108 resides in the processor 104 and schedules workloads for execution based on the processing resources of the coprocessor 106. In another example, the driver 108 is implemented by software that is exclusively accessible by a server application to which one or multiple client applications are submitting workloads. The server generally would retain exclusive access to the driver 108 and utilizes the driver 108 to schedule workloads on the coprocessor 106 when it receives workload launch requests from tasks 105 executed on the processor 104. As shown in
The scheduler 109 of the driver 108 is configured to dispatch workloads associated with the tasks 105 executed by processor 104 to compute units 115, 117, and in some examples, dispatches workloads based on a timing window of the processor 104. Scheduler 109 is configured to receive the workload launch requests from processor 104 and to schedule workloads for execution on the coprocessor 106. In some examples, scheduler 109 is configured to generate at least one launch request from the workload launch requests based on a scheduling policy. Some examples of scheduling policies are described further with respect to
For each launch request generated by scheduler 109, one or more contexts 110 include the workloads that will be scheduled and assigned to the processing resources of the coprocessor for execution. Context 110 also includes one or more queues 111 that categorize the workloads identified from the one or more launch requests. The launch requests in each queue 111 can be queued and scheduled or assigned in sequence based on the priority of the queue 111 relative to other queues organized by context 110. In some examples, the queues 111 are stored in a run-list that lists the priority of each queue. Also, driver 108 can include any number of contexts 110, and each context 110 can include any number of queues 111. In some examples, workload launch requests in different queue can be executed in parallel or in different orders provided that workloads in the queue are isolated from each other during processing.
Coprocessor 106 further includes one or more command streamers 112 configured to schedule and assign the workload(s) identified by the launch requests according to the coprocessor scheduling policy and coprocessor assignment policy to available clusters 114 and/or 116. Coprocessor 106 can include any number of command streamers 112, and in some examples, one or more command streamers 112 are shared between queues 111 and/or hosted by a dedicated context 110. Each cluster 114, 116 includes a set of respective compute units 115, 117 configured to perform data processing. In some examples, the clusters 114, 116 are statically configured (e.g., hardwired) in the coprocessor 106 in which the compute units 115 are permanently associated with cluster 114 and compute units 117 are permanently associated with cluster 116. Clusters 114, 116 are configured to execute processing associated with one or more of the workloads associated with each queue 111 when a queue is assigned to the respective cluster by command streamer 112.
A “compute unit” as used herein refers to a processing resource of a cluster. Each compute unit 115, 117, can comprise one processing core (otherwise referred to as a “single-core processing unit”) or multiple processing cores (otherwise referred to as a “multi-core processing unit”) as presented to scheduler 109 for executing workloads. Cores can be either physical or virtual cores. Physical cores include hardware (for example, processing circuitry) forming the core that physically processes an assigned workload. However, virtual cores can also be presented to scheduler 109 for processing workloads with each virtual core being implemented using the underlying physical cores.
Processor 104 and coprocessor 106 generally includes a combination of processors, microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, and/or other similar variants thereof. Processor 104 and coprocessor 106 may also include, or function with, software programs, firmware, or other computer readable instructions for carrying out various process tasks, calculations, and control functions, used in the methods described below. These instructions are typically tangibly embodied on any storage media (or computer readable media) used for storage of computer readable instructions or data structures.
Data from workload execution along with other information can be stored in a memory (not shown in
Referring to
In the example shown in
In contrast,
As described in further detail below, in some examples driver 108 is configured to schedule workload(s) in accordance with a coprocessor scheduling policy so that the coprocessor is at least partially synchronized with the processor. In other examples, driver 108 is configured to assign workloads to processing resources of the coprocessor in accordance with a coprocessor assignment policy to optimize use of the processing resources on the coprocessor. Both the scheduling policy and the assignment policy can include a policy governing preemption of workloads based on the priority of the workloads. Although described separately for pedagogical explanation, the workload scheduling, workload assignment, and workload preemption techniques can be utilized in combination.
Coprocessor Scheduling Policies
As previously described with respect to
Still referring to
At the next stage of the scheduling policy, scheduler 109 selects the workload(s) that will be scheduled for processing to the clusters 114, 116. Similar to the associated partition parameters described above, workload(s) can have associated parameters such as a workload ID, partition ID, priority, budget, cluster requirements, preemption, number of kernels, and other parameters. Once scheduler 109 selects the queue 111 and workloads associated with the selected queue 111, scheduler 109 then generates the one or more launch requests associated with the selected tasks 105 based on the coupling arrangement between processor 104 and coprocessor 106. Depending on the example, the coprocessor 106 may have varying degrees of synchronization with the processor 104. In one example, coprocessor 106 is decoupled from the processor 104 and operates asynchronously to processor 104. Thus, workload launch requests generated by the scheduler 109 are scheduled when clusters 114, 116 become available on the coprocessor 106 according to the priority of the associated workload request. In this coupling arrangement, there is little to any preemption that occurs on workloads already executing on the coprocessor 106.
In another example, the coprocessor 106 shares a loosely-coupled arrangement with processor 104. In this example, coprocessor 106 operates with some degree of synchronization with processor 104. For example, in a loosely-coupled arrangement, coprocessor 106 is synchronized at a data frame boundary with processor 104 and any unserviced workloads executed at the end of the data frame boundary are cleared at the start of a subsequent data frame. Accordingly, both processor 104 and coprocessor 106 will have the same input and output data rate in a loosely-coupled arrangement. However, coprocessor 106 will generally operate asynchronously to processor 104 during timing windows, meaning that partitions and/or tasks executing on processor 104 may execute in parallel with uncorrelated partitions and/or workloads executing on coprocessor 106. Loosely-coupled arrangements can support both preemptive and non-preemptive scheduling between processor 104 and coprocessor 106.
In yet another example, the coprocessor 106 shares a tightly-coupled arrangement with processor 104. In this example, coprocessor 106 operates with a high degree of synchronization with processor 104; that is, coprocessor 106 synchronizes queue and/or workload execution associated with a corresponding task concurrently executed by the processor 104 based on a timing window of the processor 104. Tightly-coupled arrangements can be embodied in various ways. In one implementation, the coprocessor 106 is highly synchronized with processor 104 during the same timing window, or in other words, coprocessor 106 executes workloads associated with one or more tasks currently executed on processor 104 in that timing window. As processor 104 executes another task in a subsequent timing window, coprocessor 106 then executes workloads associated with the next task executed by processor 104. In another implementation, the coprocessor 106 synchronizes with processor 104 for a subsequent timing interval but coprocessor 106 maintains the freedom to execute workloads associated with a different task consistent with other priority rules or processing availability on the coprocessor 106.
Coupling arrangements may also be combined. For example, coprocessor 106 can be loosely-coupled with processor 104 with respect to one timing window, but is tightly-coupled with processor 104 with respect to another timing window. Thus, a scheduling policy may schedule launch requests based on a combination of coupling arrangements between processor 104 and coprocessor 106, and can be dynamically updated as the system scheduling parameters change.
While cluster 1 of GPU 106 continues processing workloads from the task executed at time 302, timing window 2 (TW2) begins at time 306 at CPU 104, and CPU 104 begins processing a second task at time 306. While cluster 1 is executing workloads associated with CPU 104, cluster 2 begins executing workloads associated with the next CPU task. At time 308, cluster 1 completes processing of the workloads associated with the first CPU task and begins processing workloads associated with the second CPU task. Hence, at time 308, both clusters 1 and 2 devote processing resources to execute workloads associated with the second CPU task. In this example, the work that was previously executed only on cluster 2 has been scaled-up and now executes on both clusters 1 and 2. Then at time 310, clusters 1 and 2 finish processing the workloads associated with the second CPU task within timing window 2. Because CPU 104 has no additional tasks that require scheduling within timing window 2, clusters 1 and 2 can be allocated for processing a low-priority workload at time 310 if such a workload is available. For highest-priority workloads, driver 108 is configured to prioritize scheduling so that these workloads can begin execution within the earliest available timing window. In contrast, driver 108 is configured to schedule lowest-priority workloads whenever the processing resources become available. That is, the driver 108 makes a “best effort” to schedule low-priority workloads within the earliest available timing window, but such low-priority workloads may not be able to begin or finish execution once scheduled due to being preempted by a higher-priority workload and/or insufficient processing resources to execute the lower-priority workload. In avionics applications, a high-priority workload is associated with a high Design Assurance Level (DAL) (e.g., A-C), while a low-priority workload is associated with a low DAL (e.g., D-E).
At time 312, the timing window changes to timing window 1 and CPU 104 begins executing a third task. In some examples, the timing windows are scheduled in sequence via time division multiplexing. At time 314, GPU driver 108 receives instructions from CPU 104 to begin executing workloads associated with the third CPU task. Since the third CPU task has a higher priority than the low-priority workload being executed by clusters 1 and 2 after time 310, GPU driver 108 halts (or preempts) execution of the low-priority workload at time 314 and schedules workloads associated with the third CPU task to cluster 1 for execution. Cluster 2 optionally remains idle at time 314 as cluster 1 executes workloads. At time 316, cluster 1 finishes executing workloads associated with the third CPU task and both clusters 1 and 2 resume processing a low-priority workload. Timing windows 1 and 2 can alternate as required and may or may not be synchronized with reception of new data frames. GPU 106 can continue to process low-priority workloads as frame 1 is processed until CPU 104 executes another task that requires workload(s) on the GPU 106. The number of timing windows in a data frame can vary and in some examples are designated independently of the coprocessor scheduling policy.
At time 318, CPU 104 receives a new data frame (frame 2). Timing window 1 begins at time 320 shortly after data frame 2 is received, and CPU 104 begins executing a fourth CPU task. At time 322, GPU 106 then executes a workload to determine whether the fourth CPU task requires processing; GPU driver 108 assigns this workload to cluster 1 as shown in
At time 325, CPU 104 executes a fifth CPU task and subsequently sends a workload request for a workload associated with the fifth CPU task to GPU driver 108 at time 326. In this case, the GPU 106 and CPU 104 execute corresponding respective workloads and tasks in parallel, in which the CPU 104 waits for the GPU 106 to “catch up.” Hence, the CPU 104 and GPU 106 execute in parallel. The workload associated with the fifth CPU task preempts the low-priority workload previously executing on GPU 106. At time 328, clusters 1 and 2 finish the workload associated with the fifth CPU task and resume processing of a low-priority workload. Finally, at time 330, CPU 104 executes a sixth CPU task at timing window 1 and determines that no additional processing is required to execute the sixth CPU task. Accordingly, GPU 106 continues to process the low-priority workload for the remaining time of timing window 1 until a new data frame (frame 3) is received.
Referring first to diagram 400A, CPU 104 executes a first CPU task within timing window 1 at time 401. GPU driver 108 subsequently determines that processing is required for the first CPU task and at time 402, GPU driver 108 schedules a workload associated with the first CPU task to cluster 1 for execution. Cluster 1 continues to execute the workload for the remainder of timing window 1, but is unable to finish execution of the workload before timing window 2 begins at time 403. When timing window 2 starts, CPU 104 begins executing a second CPU task that requires processing from GPU 106. The workload processed by cluster 1 during timing window 1 becomes preempted at the timing window 2 boundary. Since in this example GPU 106 is synchronized to CPU 104 at the timing window boundary, cluster 1 halts processing of workloads associated with the first CPU task once timing window 2 begins at time 403. Meanwhile, cluster 2 begins processing of workloads associated with the second CPU task during the time of timing window 2.
At time 404, the timing window reverts to timing window 1. At this point, the workload processed by cluster 2 during timing window 2 becomes preempted, and at time 404 cluster 1 resumes processing of the workload associated with the first CPU task that had previously been preempted at the start of timing window 2. As cluster 1 resumes processing of the first CPU task workload, CPU 104 also executes a third CPU task for processing. At time 405, cluster 1 finishes processing of the workload associated with the first CPU task and at time 406 begins processing of the workload associated with the third CPU task. At time 407, processing of the third CPU task workload becomes preempted as the timing window reverts to timing window 2. At this point, cluster 2 resumes processing of the workload associated with the second CPU task.
At time 408, a new data frame (frame 2) is received. At time 409, CPU 104 executes a fourth CPU task. At time 410, GPU driver 108 schedules a light workload associated with the fourth CPU task and determines that additional processing is not required for the fourth CPU task. Therefore, GPU driver 108 schedules a low-priority workload during the remainder of timing window 1. The low-priority workload is preempted once timing window 2 begins, and CPU 104 executes a fifth CPU task. Then, at time 411, GPU driver 108 schedules a workload associated with the fifth CPU task to both clusters 1 and 2. At time 412, GPU 106 completes execution of the workload associated with the fifth CPU task and resumes processing of a low-priority workload for the remainder of timing window 2.
At time 419, the timing window changes to timing window 1 and the workloads queued from the previous timing window can now be executed by clusters 1 and/or 2 of GPU 106. However, in some examples, the queued workloads are delayed by a designated time within the current timing window. For example, queued workloads optionally include an estimated time required to complete the workload. If the estimated time is less than the duration of the current timing window, the queued workload can be delayed until the time remaining in the current timing window is equal to the estimated completion time of the queued workloads. This is illustrated in diagram 400B, as the workload associated with the second CPU task is not performed by GPU 106 at the start of timing window 1, but rather, begins at time 421 after some time has elapsed. Instead, both clusters 1 and 2 process a low-priority workload during time 419 until the remaining time in timing window 1 equals the estimated completion time with the workload associated with the second CPU task.
Also at time 419, CPU 104 begins processing a third CPU task. At time 420, CPU 104 registers a workload associated with the third CPU task, which the GPU 106 begins executing with cluster 1 at time 422 for the duration of the subsequent timing window.
A new data frame (frame 2) is subsequently received. Beginning at time 423, CPU 104 begins processing a fourth CPU task at timing window 1. Since no workloads were registered by CPU 104 at the previous timing window, GPU 106 begins processing a low-priority workload for the duration of timing window 1 beginning at time 423. Timing window 2 begins at time 424. At this time, cluster 1 of GPU 106 executes a light workload associated with the fourth CPU task to cluster 1 while CPU 104 begins processing a fifth CPU task. At time 425, cluster 1 finishes processing the light workload associated with the fourth CPU task and resumes processing of a low-priority workload (along with cluster 2) for the duration of timing window 2. At time 426, CPU 104 executes a sixth CPU task while clusters 1 and 2 begin processing workloads associated with the fifth CPU task from the previous timing window. Once completed, clusters 1 and 2 resume processing of a low-priority workload beginning at time 427 for the duration of timing window 1.
In some examples, the order of priority for a given task is based on the timing window in which it is initially scheduled. For example, for a given set of three workloads to be executed on the GPU (W1, W2, W3), W1 can have the highest priority in timing window 1 and therefore will not be preempted by W2 or W3 during timing window 1. Once timing window 2 begins, the priority can change so that W2 has the highest priority, enabling W2 to immediately schedule for execution and preempt W1 if W1 has not finished execution during timing window 1. Similarly, once the timing window switches to timing window 3, W3 then has the highest priority and may be scheduled immediately for execution and may preempt W2. Thus, as the timing window changes between timing windows 1, 2, and 3, the order of priority between the workloads assigned to the GPU can also change.
Referring to
In various examples, some CPU cores may execute tasks based on whether a corresponding GPU cluster is currently executing workloads. For example, consider CPU 1. As shown in
In additional or alternative examples, some CPU-GPU synchronization configurations are not “blocked.” In these examples, the CPU is configured to execute tasks independently of whether a corresponding GPU cluster is currently executing workloads associated with another task. As shown in
In some examples, the CPU-GPU synchronization is partially “blocked” so that the CPU is free to execute tasks simultaneously with a corresponding GPU until the GPU becomes too backlogged with workloads from the CPU. In that case the CPU may wait until the GPU finishes a certain amount of workloads to “catch up” with the CPU. For example, CPU 2 may wait a certain time period until cluster 2 finishes execution of workloads 508 before executing task 512.
Now referring to
Examples of the coprocessor scheduling policy (and the coprocessor assignment policy described further herein) optionally include a preemption policy which governs the preemption of workloads configured for execution by the coprocessor GPUs. Examples of preemption scheduling are illustrated in
After CPU 2 finishes high-priority task 604, cluster 1 may execute workloads 608 associated with the high-priority task 604. However, at the time CPU 2 finishes high-priority task 604, cluster 1 is already executing workloads 606 associated with the low-priority task 602 executed on CPU 1. In a non-preemptive example as shown in
Conversely,
For examples that implement workload preemption on the coprocessor such as a GPU, the coprocessor receives a request, for example, from a driver that specifies when the preemption of a lower-priority workload will occur. A preemption policy can be implemented through hardware and/or software. In one hardware example, preemption occurs at the command boundary such that lower-priority workloads (or contexts including a set of lower-priority workloads) are preempted once the command is completed, or at the earliest preemptable command (that is, when the GPU can implement the next command). In another hardware example, preemption occurs at the thread boundary, where a lower-priority context stops issuing additional lower-priority workloads and becomes preempted when all workloads being currently executed are finished. In yet another hardware example, workload execution is preempted by saving the workload state into memory, which can be restored once execution is resumed. In another hardware example, preemption occurs during execution of a thread, in which the GPU can immediately stop execution of a lower-priority thread and store the previously executed part of the thread into memory for later execution.
A coprocessor may also implement preemption through software. In one software example, preemption occurs at the thread boundary as previously described in hardware implementations. In another software example, preemption occurs immediately upon receiving the request and any current or previously executed workloads within the same context must be restarted at a later time period, analogous to resetting a lower-priority workload as referenced in
Method 700 includes block 702 of receiving workload launch requests from one or more tasks executing on a processor, such as by a driver implemented on a coprocessor or other processing unit. The workload launch requests include a list of the workloads associated with a task executed by the processor, and may include other parameters such as the priority of workloads in the list and the processing resources required to execute the respective workloads on the coprocessor. At block 704, method 700 proceeds by generating at least one launch request from the workload launch requests and based on a coprocessor scheduling policy. The driver or other processing unit can then schedule workloads for execution on the coprocessor based on the launch requests and the coprocessor scheduling policy (block 705).
Depending on the example, method 700 proceeds based on the terms of the coprocessor scheduling policy. Optionally, method 700 proceeds to block 706 and schedules workloads for execution independent of a time period (e.g., a timing window and/or data frame boundary) of the processor or other external events. In this loosely-coupled configuration, the coprocessor can schedule workloads asynchronously to the timing of the processor. Such loosely-coupled configurations optionally enable workload scheduling based on an order of priority between the workloads received by the coprocessor. For example, even though the coprocessor may schedule workloads asynchronously to the processor timing windows, the coprocessor scheduling policy may include a preemption policy that preempts lower priority workloads currently executed or queued on the coprocessor with higher priority workloads.
Additionally, or alternatively, method 700 optionally proceeds to block 708 and schedules workloads for execution based on a timing window of the processor. In one implementation, method 700 schedules workloads for execution on the coprocessor during the same timing window of the processor. In another implementation, method 700 schedules workloads for execution on the coprocessor in the same timing window, wherein the coprocessor 106 synchronizes with processor 104 in the same timing window but maintains the freedom to execute workloads associated with a different queue and/or task consistent with other priority rules or processing availability on the coprocessor. That is, a coprocessor scheduling policy optionally includes a preemption policy that applies to tightly-coupled configurations and which schedules workloads for execution based on an order of priority of workloads. When a workload launch request includes workloads with higher priority than workloads currently executed on the coprocessor, the coprocessor scheduling policy configures the coprocessor to preempt the lower priority workloads and synchronize the higher priority workloads to the subsequent timing window of the processor or another common even between the coprocessor and processor.
As previously described with respect to
In some examples, GPU jobs are assigned to clusters using an exclusive policy where workloads associated with different CPU tasks are assigned exclusively to different clusters for one or more timing intervals. Referring to
In contrast to an exclusive assignment policy, an interleaved assignment policy assigns workloads associated with the same CPU task simultaneously to multiple clusters of the GPU. As shown in
Both the exclusive and interleaved assignment policies correspond to a static assignment policy that assigns workloads to clusters/partitions independent of workload priority or computing resources. Conversely, a policy-distributed assignment policy exemplifies a dynamic assignment policy that considers workload priority and the computing resources of a cluster/partition. A workload associated with a processor task that is higher in priority than another workload associated with another processor task will generally be assigned before the lower priority workload and will generally be assigned to more available clusters than the low-priority workload. The amount of clusters or partitions that the workload is assigned to depends on the amount of resources necessary to process the workload and/or the amount of computing resources currently available in the coprocessor.
In the example depicted in
A workload may sometimes require processing that exceeds the currently available computing resources in the coprocessor. Therefore, in some examples, the assignment policy (including any of the assignment policies previously described) includes a policy governing the assignment of queued workloads that exceed the currently available computing resources depending on the hardware of the coprocessor and the system parameters. In one example, a workload that exceeds currently available computing resources simply remains queued until more computing resources become available that meet the processing requirements of the workload, thereby leaving the limited number of available computing resources idle until a subsequent time period. In another example, the available computing resources are assigned to the highest priority workload currently executed on the coprocessor; that is, the highest priority workload currently executed receives more processing resources (e.g., clusters, partitions, or compute units) than originally requested. In another example, the heavy workload begins execution even if insufficient computing resources are currently available. In yet another example, the highest priority workload with sufficient processing demands that meet the available computing resources is assigned the available computing resources.
The coprocessor assignment policy may include a combination of the policies described herein. For example, the coprocessor assignment policy may include a mixed exclusive-shared policy, where one or more clusters are exclusively assigned workloads (that is, one cluster receives workloads associated with one queue and another cluster receives workloads associated with another queue), while another cluster implements a shared policy that includes workloads associated with different tasks.
Workloads 1015 and optionally workloads 1016 are sent to a command streamer 1010 for assignment to clusters 1006 or cluster 1008. For example, if queue 1014 includes only workloads 1015, then workloads 1015 are assigned to at least one of clusters 1006 comprising a plurality of compute units 1007. However, when queue 1014 contains workloads 1016, command streamer 1010 is configured to assign workloads 1016 to at least one compute unit 1009 of cluster 1008. In other examples, the assignment of workloads is governed by software methods. As shown in
In some examples, the GPU 1000 includes a plurality of queues and command streamers that assign workloads to distinct computing resources on the GPU. For example,
In another example,
Method 1100 includes receiving one or more workload launch requests from one or more tasks executing on a processor as shown in block 1102. Method 1100 then proceeds block 1104 by generating at least one launch request including one or more workloads based on a coprocessor assignment policy. Method 1100 then proceeds to block 1105 by assigning workloads identified in the launch requests to processing resources on the coprocessor based on the coprocessor assignment policy. For example, method 1100 optionally proceeds to block 1106 to assign each workload of a launch request to a dedicated cluster of compute units according to an exclusive policy.
Additionally, or alternatively, method 1100 proceeds to block 1108 and assigns each workload of a launch request across a plurality of distinct clusters according to an interleaved policy. In one example of this policy, a first workload in the launch request (e.g., the workload with the highest priority) is assigned first to all the clusters during a first timing interval, followed by a second workload assigned to all the clusters during a second timing interval, and so on so that each workload is sequentially assigned to each of the clusters.
Additionally, or alternatively, method 1100 proceeds to block 1110 and assigns each workload of a launch request to at least one cluster based on the computing parameters and/or the priority of the workload according to a policy-distributed policy. For example, each workload is individually assigned to at least one cluster for a duration of execution on the clusters. A workload associated with a processor task that is higher in priority than another workload will generally be assigned before the lower priority workload and will generally be assigned to more available clusters than the low-priority workload. The amount of clusters or partitions that the workload is assigned to depends on the amount of resources necessary to process the workload and/or the amount of computing resources currently available in the coprocessor.
In some examples the coprocessor assignment policy includes a policy governing the assignment of queued workloads that exceed the currently available computing resources depending on the hardware of the coprocessor and the system parameters. In one example, a workload that exceeds currently available computing resources simply remains queued until more computing resources become available that meet the processing requirements of the workload, thereby leaving the limited number of available computing resources idle until a subsequent time period. In another example, the available computing resources are assigned to the highest priority workload currently executed on the coprocessor; that is, the highest priority workload currently executed receives more processing resources (e.g., clusters, partitions, or compute units) than originally requested. In another example, the heavy workload begins execution even if insufficient computing resources are currently available. And in yet another example, the highest priority workload with sufficient processing demands that meet the available computing resources is assigned the available computing resources.
Additionally, or alternatively, method 1100 proceeds to block 1112 and assigns multiple workloads of the launch request between multiple clusters during the same timing interval so that portions of the workload are shared between multiple clusters according to a shared assignment policy. In one example, each workload in the launch request is shared across all clusters during the same timing interval so that each cluster is processing each workload simultaneously. Other coprocessor assignment policies are possible.
Method 1200 includes block 1202 and receives information on workload budget constraints, for example, from workload launch requests received by a driver. When a currently executed workload runs out of budget, method 1200 proceeds to block 1203 and determines whether there is any additional processing budget remaining after processing workload on the coprocessor. If there is additional processing budget remaining, method 1200 proceeds to block 1204 and acquires the corresponding task budget from the completed workloads and additionally receives the priority of the completed workloads. From there, the additional budget and priority information can be used to process queued workloads during a subsequent timing interval.
If no budget is available, then method 1200 proceeds to block 1206 to preempt and/or stop the currently executed workload. Optionally, method 1200 can then proceed to block 1208 and reschedule workloads for execution to the coprocessor. This example can be implemented when scheduling workloads according to the coprocessor scheduling policy as previously described. Additionally, or alternatively, method 1200 proceeds to block 1210 (either directly from block 1208 or from block 1206) to reassign the workload priority and optionally reschedule workloads for execution based on the updated workload priority.
Method 1300 includes block 1302 and sorts workloads from one or more workload launch requests into one or more contexts. In some examples, a coprocessor includes multiple queues each independently configured with distinct workloads isolated from workloads associated with another queue. In these examples, workloads are sorted into each of the multiple contexts. Alternatively, for coprocessors that have only one context, all workloads are sorted into the single context.
Method 1300 then proceeds to block 1304 by sorting the workloads within a given context based on the priority of each workload in the context. This step is repeated or conducted in parallel for each context that is supported by the coprocessor. In some examples, the number of contexts depends on the number of queues on the coprocessor. For example, a coprocessor may have two queues that respectively correspond to one of the contexts. For coprocessors that implement multiple contexts, method 1300 optionally proceeds to block 1306 to sort the contexts based on the priority of the queues associated with the context. The queue with the highest priority will be scheduled and assigned first in the list of queued contexts. For single queue coprocessors, block 1306 is not required because the coprocessor computing resources will receive the single queue that contains the list of workloads scheduled for execution. Once the context is selected based on the priority of the queues, the computing resources begin executing the respective queues in the selected context based on the priority of the workloads within the context. The assortment of priority for each queue and/or context is determined for a point in time and may be further updated or adjusted as additional workloads requests become available.
Beginning at block 1402, method 1400A selects the highest priority context for a plurality of contexts that each include a plurality of workloads to be scheduled and assigned to processing resources of a coprocessor. Method 1400A then proceeds to block 1403 and determines, for a workload in the given context, whether there are higher priority workloads that remain in the context. If no higher priority workload exists in the context, method 1400A optionally proceeds to block 1404 to allocate one or more clusters to execute work as defined by the coprocessor assignment policy, examples of which are previously described above. Additionally or alternatively, method 1400A terminates at block 1408.
For higher priority workloads that still exist in the context, method 1400A instead proceeds to block 1406 and prepares the highest priority workload in the context for execution. Method 1400A can then proceed further to indicator block A (block 1410) to continue further into method 1400B.
From indicator block A (block 1410), method 1400B proceeds to block 1411 and determines whether there is sufficient space on the coprocessor to execute the higher priority workload prepared in block 1406. If so, then method 1400B proceeds to block 1418 and launches the higher priority workload on the coprocessor. In examples where a context supports multiple queues, the workloads may be distributed among the queues before their associated workloads are executed on the coprocessor.
If insufficient space is available on the coprocessor, then method 1400B instead optionally proceeds to block 1412 by determining whether there are any workloads currently executed or scheduled that are utilizing extra clusters on the coprocessor, which can be determined based on scheduling parameters associated with the tasks, partition, or timing window, including budget, priority rules, requested number of clusters, among other parameters. If none of the workloads executing or scheduled are utilizing extra clusters, then method 1400B proceeds to block 1416 and preempts a lower priority workload(s) based on the preemption policy until there are sufficient clusters for the higher priority workload to execute. From there, method 1400B can proceed back to block 1411 to determine whether there is sufficient space on the GPU. Otherwise, if there are such workloads utilizing extra clusters on the coprocessor, method 1400B instead optionally proceeds to block 1413 by preempting workloads using extra clusters. Method 1400B then optionally determines at block 1414 again whether there is sufficient space on the coprocessor to launch the higher priority workload after preempting the extra clusters. If not, method 1400B proceeds to block 1416 and preempts the lowest priority workload(s) based on the preemption policy until there are sufficient clusters for the higher priority workload to execute. However, if sufficient space is available at block 1414, then method 1400B proceeds to block 1418 and launches the higher priority workload on the coprocessor. Method 1400B can then proceed to indicator block B (block 1420) and continue further into method 1400C.
Beginning from indicator block B (block 1420), method 1400C proceeds to block 1421 and determines whether there are any idle or available clusters on the coprocessor. If there are no idle clusters (all clusters are currently processing workloads), method 1400C ends at block 1428. If there are idle clusters on the coprocessor, method 1400C then optionally proceeds to block 1422 to determine whether there is sufficient space available for the next highest priority workload. If there is sufficient space available on the idle clusters to process the next workload, method 1400C proceeds to block 1426 and prepares the highest priority work for execution on at least one of the idle clusters. However, if there are idle clusters but there is not enough space to execute the next highest priority workload at block 1422, then method 1400C optionally proceeds to block 1424 to allocate the idle clusters based on the coprocessor assignment policy. For example, rather than execute the next highest priority workload, the idle clusters can be allocated to process currently executed workloads on other clusters by a policy-distributed coprocessor policy or any of the other coprocessor assignment policies described herein. Method 1400C can then ends at block 1428.
The methods and techniques described herein may be implemented in digital electronic circuitry, or with a programmable processor (for example, a special-purpose processor or a general-purpose processor such as a computer) firmware, software, or in various combinations of each. Apparatus embodying these techniques may include appropriate input and output devices, a programmable processor, and a storage medium tangibly embodying program instructions for execution by the programmable processor. A process embodying these techniques may be performed by a programmable processor executing a program of instructions to perform desired functions by operating on input data and generating appropriate output. The techniques may advantageously be implemented in one or more programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instruction to, a data storage system, at least one input device, and at least one output device. Generally, a processor will receive instructions and data from a read-only memory and/or a random-access memory. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and digital video disks (DVDs). Any of the foregoing may be supplemented by, or incorporated in, specially-designed application specific integrated circuits (ASICs).
Example 1 includes a processing system comprising: a processor; a coprocessor configured to implement a processing engine; a processing engine scheduler configured to schedule workloads for execution on the coprocessor; wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing on the processor, and in response generate at least one launch request for submission to the coprocessor based on a coprocessor scheduling policy; wherein based on the coprocessor scheduling policy, the processing engine scheduler selects which coprocessor clusters are activated to execute workloads identified by a queue based on the at least one launch request; and wherein the coprocessor scheduling policy defines at least one of: tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to immediately execute on the coprocessor within time of a timing window in which the one or more tasks are being executed on the processor; or tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to execute on the coprocessor based on an order of priority and either: with respect to an external event common to both the processor and coprocessor, or during a subsequent timing window after the time of the timing window in which the one or more tasks are being executed on the processor.
Example 2 includes the processing system of Example 1, wherein the coprocessor scheduling policy defines a loosely-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to execute independently of a timing window of the one or more tasks executing on the processor and based on an order of priority of the workloads.
Example 3 includes the processing system of any of Examples 1-2, wherein the processor includes a central processing unit (CPU) including at least one processing core, and the coprocessor includes a graphics processing unit (GPU), a processing accelerator, a field-programmable gate array (FPGA), or an application-specific integrated circuit (ASIC).
Example 4 includes the processing system of any of Examples 1-3, wherein the processor comprises a plurality of processor cores, and wherein the processing engine scheduler is configured to generate at least one launch request that schedules workloads associated with one processor cores of the plurality of processing cores to multiple clusters of the coprocessor for execution.
Example 5 includes the processing system of any of Examples 1-4, wherein the workloads identified by the at least one launch request are scheduled to execute during a subsequent timing window after the time of the timing window in which the one or more tasks are being executed on the processor, and/or during a subsequent data frame boundary of the processor.
Example 6 includes the processing system of any of Examples 1-5, wherein the coprocessor scheduling policy includes a preemption policy that defines a coupled coprocessor scheduling where one or more workloads scheduled for execution or currently being executed on the coprocessor are configured to be preempted by one or more workloads queued to be executed based on the order of priority.
Example 7 includes the processing system of Example 6, wherein the one or more workloads currently being executed on the coprocessor are configured to be preempted by one or more higher priority workloads queued to be executed, and wherein the coprocessor is configured to: store the one or more workloads currently being executed on the coprocessor; and reschedule the stored one or more workloads for execution during a subsequent timing window that is after the higher priority workloads have been executed.
Example 8 includes the processing system of any of Examples 6-7, wherein the preemption policy defines at least one of: a coupled coprocessor scheduling where one or more workloads currently executed on the coprocessor are configured to be completed and a subsequent workload queued for execution is preempted by a higher priority workload; a coupled coprocessor scheduling where one or more workloads currently being executed on the coprocessor are configured to be preempted by a higher priority workload; a coupled coprocessor scheduling where one or more workloads currently being executed on the coprocessor are configured to be preempted by a higher priority workload, wherein the one or more workloads include an indicator that identifies a portion of a respective workload that has been already executed, and wherein the one or more workloads are configured to be stored and re-executed starting at the indicator; or a coupled coprocessor scheduling where the one or more workloads scheduled for execution are partitioned into a plurality of sub-portions and each of the plurality of sub-portions are configured to be preempted by a higher priority workload.
Example 9 includes the processing system of any of Examples 1-8, wherein the processing engine includes a computing engine, a rendering engine, or an artificial intelligence (AI) inference engine, and wherein the processing engine scheduler includes a computing engine scheduler, a rendering engine scheduler, or an inference engine scheduler.
Example 10 includes a coprocessor configured to be coupled to a processor and configured to implement a processing engine, the coprocessor comprising: at least one cluster configured to execute workloads; a processing engine scheduler configured to schedule workloads for execution on the coprocessor; wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing on the processor, and in response generate at least one launch request for submission based on a coprocessor scheduling policy; wherein based on the coprocessor scheduling policy, the processing engine scheduler selects which of the at least one cluster is activated to execute workloads identified by a queue comprising the at least one launch request; and wherein the coprocessor scheduling policy defines at least one of: tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to immediately execute on the coprocessor within time of a timing window in which the one or more tasks are being executed on the processor; or tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to execute on the coprocessor based on an order of priority and either: with respect to an external event common to both the processor and coprocessor, or during a subsequent timing window after the time of the timing window in which the one or more tasks are being executed on the processor.
Example 11 includes the coprocessor of Example 10, wherein the processing engine includes a computing engine, a rendering engine, or an artificial intelligence (AI) inference engine, and wherein the processing engine scheduler includes a computing engine scheduler, a rendering engine scheduler, or an inference engine scheduler.
Example 12 includes the coprocessor of any of Examples 10-11, wherein the coprocessor scheduling policy defines a loosely-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to execute independently of a timing window of the one or more tasks executing on the processor and based on an order of priority of the workloads.
Example 13 includes the coprocessor of any of Examples 10-12, wherein the processing engine scheduler is configured to generate at least one launch request that schedules workloads associated with one processing core of a plurality of processing cores of the processor to multiple clusters of the coprocessor for execution.
Example 14 includes the coprocessor of any of Examples 10-13, wherein the workloads identified by the at least one launch request are scheduled to execute during a subsequent timing window after the time of the timing window in which the one or more tasks are being executed on the processor, and/or during a subsequent data frame boundary of the processor.
Example 15 includes the coprocessor of any of Examples 10-14, wherein the coprocessor scheduling policy includes a preemption policy that defines a coupled coprocessor scheduling where one or more workload scheduled for execution or currently executed on the coprocessor are configured to be preempted by one or more workloads queued to be executed based on the order of priority.
Example 16 includes the coprocessor of Example 15, wherein the one or more workloads currently executed on the coprocessor are configured to be preempted by one or more higher priority workloads queued to be executed, and wherein the coprocessor is configured to: store one or more workloads currently executed on the coprocessor; and reschedule the stored one or more workloads for execution during a subsequent timing window that is after the higher priority workloads have been executed.
Example 17 includes the coprocessor of any of Examples 15-16, wherein the preemption policy defines at least one of: a coupled coprocessor scheduling where one or more workloads currently executed on the coprocessor are configured to be completed and a subsequent workload queued for execution is preempted by a higher priority workload; a coupled coprocessor scheduling where one or more workloads currently executed on the coprocessor are configured to be preempted by a higher priority workload; a coupled coprocessor scheduling where one or more workloads currently executed on the coprocessor are configured to be preempted by a higher priority workload, wherein the one or more workloads include an indicator that identifies a portion of a respective workload that has been already executed, and wherein the one or more workloads are configured to be stored and re-executed starting at the indicator; or a coupled coprocessor scheduling where the one or more workloads scheduled for execution are partitioned into a plurality of sub-portions and each of the plurality of sub-portions are configured to be preempted by a higher priority workload.
Example 18 includes a method, comprising: receiving one or more workload launch requests from one or more tasks executing on a processor, wherein the one or more workload launch requests include one or more workloads configured for execution on a coprocessor; generating at least one launch request in response to the one or more workload launch requests based on a coprocessor scheduling policy; scheduling one or more workloads identified in the at least one launch request for execution on the coprocessor based on the coprocessor scheduling policy by at least one of: a tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to immediately execute on the coprocessor within time of a timing window in which the one or more tasks are being executed on the processor; or a tightly-coupled coprocessor scheduling where workloads identified by the at least one launch request are scheduled to execute on the coprocessor based on an order of priority and either: with respect to an external event common to both the processor and coprocessor, or during a subsequent timing window after the time of the timing window in which the one or more tasks are being executed on the processor.
Example 19 includes the method of Example 18, comprising preempting at least one workload scheduled for execution or currently executed on the coprocessor by one or more workloads queued to be executed based on the order of priority.
Example 20 includes the method of Example 19, wherein preempting at least one workload scheduled for execution or currently executed on the coprocessor comprises: a coupled coprocessor scheduling where one or more workloads currently executed on the coprocessor are configured to be completed and a subsequent workload queued for execution is preempted by a higher priority workload; a coupled coprocessor scheduling where one or more workloads currently executed on the coprocessor are configured to be preempted by a higher priority workload; a coupled coprocessor scheduling where one or more workloads currently executed on the coprocessor are configured to be preempted by a higher priority workload, wherein the one or more workloads include an indicator that identifies a portion of a respective workload that has been already executed, and wherein the one or more workloads are configured to be stored and re-executed starting at the indicator; or a coupled coprocessor scheduling where the one or more workloads scheduled for execution are partitioned into a plurality of sub-portions and each of the plurality of sub-portions are configured to be preempted by a higher priority workload.
Example 21 includes a processing system comprising: a processor; a coprocessor configured to implement a processing engine; a processing engine scheduler configured to schedule workloads for execution on the coprocessor; wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing or executed on the processor, and in response generate at least one launch request for submission to the coprocessor; wherein the coprocessor comprises a plurality of compute units and at least one command streamer associated with one or more of the plurality of compute units; wherein, based on a coprocessor assignment policy, the processing engine scheduler is configured to assign for a given execution partition, via the at least one command streamer, clusters of compute units of the coprocessor to execute one or more workloads identified by the one or more workload launch requests as a function of workload priority; wherein the coprocessor assignment policy defines at least: an exclusive assignment policy wherein each workload is executed by a dedicated cluster of compute units; an interleaved assignment policy wherein each workload is exclusively executed across all compute units of the clusters of compute units; a policy-distributed assignment policy wherein each workload is individually assigned to at least one cluster of the clusters of compute units and an execution duration during the given execution partition; or a shared assignment policy wherein each workload is non-exclusively executed by the clusters of compute units each concurrently executing multiple workloads.
Example 22 includes the processing system of Example 21, wherein the processor includes a central processing unit (CPU) including at least one processing core, and the coprocessor includes a graphics processing unit (GPU), a processing accelerator, a field-programmable gate array (FPGA), or an application-specific integrated circuit (ASIC).
Example 23 includes the processing system of any of Examples 21-22, wherein the at least one command streamer is configured to receive workloads from a shared context comprising workloads associated with a plurality of tasks executing or executed on the processor, and wherein the at least one command streamer is configured to assign a workload associated with a first task of the plurality of tasks to a first set of clusters and to assign a workload associated with a second task of the plurality of tasks to a second set of clusters distinct from the first set of clusters.
Example 24 includes the processing system of any of Examples 21-23, wherein the at least one command streamer comprises a plurality of command streamers configured to receive workloads from a shared context, wherein the shared context comprises a plurality of queues, wherein a first command streamer of the plurality of command streamers is configured to: receive first workloads associated with a first task executing or executed on the processor from a first queue of the plurality of queues, and assign the first workloads to a first set of clusters of compute units; wherein a second command streamer of the plurality of command streamers is configured to: receive second workloads associated with a second task distinct from the first task executing or executed on the processor from a second queue of the plurality of queues distinct from the first queue; and assign the second workloads to a second set of clusters of compute units distinct from the first set of clusters of compute units.
Example 25 includes the processing system of any of Examples 21-24, wherein the at least one command streamer comprises a plurality of command streamers, wherein a first command streamer of the plurality of command streamers is configured to: receive first workloads associated with a first task executing or executed on the processor from a from a first queue of a first context, and assign the first workloads to a first set of clusters of compute units; wherein a second command streamer of the plurality of command streamers is configured to: receive second workloads associated with a second task distinct from the first task executing or executed on the processor from a second queue of a second context distinct from the first context; and assign the second workloads to a second set of clusters of compute units distinct from the first set of clusters of compute units.
Example 26 includes the processing system of any of Examples 21-25, wherein the coprocessor assignment policy includes a preemption policy that defines that one or more workloads assigned to one or more clusters of the clusters of compute units or being assigned to the one or more clusters of the clusters of compute units are configured to be preempted by one or more workloads queued to be assigned for execution on the coprocessor based on the workload priority.
Example 27 includes the processing system of any of Examples 21-26, wherein: the processing engine includes a computing engine and the processing engine scheduler includes a computing engine scheduler; the processing engine includes a rendering engine and the processing engine scheduler includes a rendering engine scheduler; or the processing includes an artificial intelligence (AI) inference engine and the processing engine scheduler includes an inference engine scheduler.
Example 28 includes the processing system of any of Examples 21-27, wherein the processing engine scheduler is configured to assign one or more clusters of the clusters of compute units to execute the workloads based on an amount of processing required to complete the workloads.
Example 29 includes the processing system of Example 28, wherein the processing engine scheduler is configured to assign one or more additional clusters of the clusters of compute units to execute the workloads to compensate for when the amount of processing required to complete the workloads exceeds currently available processing resources on the coprocessor.
Example 30 includes a coprocessor configured to be coupled to a processor and configured to implement a processing engine, the coprocessor comprising: a plurality of compute units each configured to execute workloads; a processing engine scheduler configured to assign workloads for execution on the coprocessor; wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing or executed on the processor, and in response to generate at least one launch request for submission to the coprocessor; wherein the coprocessor comprises at least one command streamer associated with one or more of the plurality of compute units; wherein, based on a coprocessor assignment policy, the processing engine scheduler is configured to assign for a given execution partition, via the at least one command streamer, clusters of compute units of the coprocessor to execute one or more workloads identified by the one or more workload launch requests as a function of workload priority; wherein the coprocessor assignment policy defines at least: an exclusive policy wherein each workload is executed by a dedicated cluster of the clusters of compute units; an interleaved policy wherein each workload is exclusively executed across all compute units of at least one cluster of the clusters of compute units; a policy-distributed policy wherein each workload is individually assigned to at least one cluster of the clusters of compute units and an execution duration during the given execution partition; a shared policy wherein each workload is non-exclusively executed by the clusters of compute units each concurrently executing multiple workloads.
Example 31 includes the coprocessor of Example 30, wherein the coprocessor assignment policy defines two or more of the exclusive policy, interleaved policy, policy-distributed policy, or shared policy, and wherein the processing engine scheduler is configured to adjust the coprocessor assignment policy from one policy to a second policy at a subsequent timing boundary associated with the processor.
Example 32 includes the coprocessor of any of Examples 30-31, wherein the processing engine scheduler is configured to determine unused processing resources allocated to a completed workload, and to assign a subsequently queued workload for execution on the coprocessor based on the unused processing resources and processing resources allocated to the subsequently queued workload.
Example 33 includes the coprocessor of any of Examples 30-32, wherein the exclusive policy defines at least one of an exclusive-access policy and an exclusive-slice policy, wherein the exclusive-access policy defines an assignment policy wherein each workload is assigned to all clusters of the clusters of compute units, wherein the exclusive-slice policy defines an assignment policy: wherein workloads associated with a first task executing or executed on the processor are assigned to a first plurality of clusters and wherein workloads associated with a second task executing or executed on the processor are assigned to a second plurality of clusters; and/or wherein workloads associated with a first task executing or executed on the processor are assigned to first portions of a cluster, and wherein workloads associated with a second task executing or executed on the processor are assigned to second portions of the cluster.
Example 34 includes the coprocessor of any of Examples 30-33, wherein the coprocessor assignment policy includes a preemption policy that defines that one or more workloads assigned to one or more clusters of the clusters of compute units or being assigned to the one or more clusters of the clusters of compute units are configured to be preempted by one or more workloads queued to be assigned for execution based on the workload priority.
Example 35 includes the coprocessor of any of Examples 30-34, wherein: the processing engine includes a computing engine and the processing engine scheduler includes a computing engine scheduler; the processing engine includes a rendering engine and the processing engine scheduler includes a rendering engine scheduler; or the processing includes an artificial intelligence (AI) inference engine and the processing engine scheduler includes an inference engine scheduler.
Example 36 includes the coprocessor of any of Examples 30-35, wherein the processing engine scheduler is configured to assign one or more clusters of the clusters of compute units to execute the workloads based on an amount of processing required to complete the workloads.
Example 37 includes the coprocessor of Example 36, wherein the processing engine scheduler is configured to assign one or more additional clusters of the clusters of compute units to execute the workloads to compensate for when the amount of processing required to complete the workloads exceeds currently available processing resources on the coprocessor.
Example 38 includes a method, comprising: receiving one or more workload launch requests from one or more tasks executing or executed on a processor, wherein the one or more workload launch requests include one or more workloads configured for execution on a coprocessor; generating at least one launch request in response to the one or more workload launch requests; assigning clusters of compute units of the coprocessor to execute one or more workloads identified in the one or more workload launch requests as a function of workload priority based on a coprocessor assignment policy, wherein the coprocessor assignment policy defines at least: an exclusive policy wherein each workload is executed by a dedicated cluster of the clusters of compute units; an interleaved policy wherein each workload is exclusively executed across all compute units of at least one cluster of the clusters of compute units; a policy-distributed policy wherein each workload is individually assigned to at least one cluster of the clusters of compute units and an execution duration during a given execution partition; a shared policy wherein each workload is non-exclusively executed by the clusters of compute units each concurrently executing multiple workloads.
Example 39 includes the method of Example 38, comprising preempting at least one workload scheduled for execution or currently executed on the coprocessor by one or more workloads queued to be executed based on the workload priority.
Example 40 includes the method of any of Examples 38-39, wherein assigning clusters of compute units of the coprocessor to execute one or more workloads identified in the one or more workload launch requests comprising assigning one or more additional clusters of the clusters of compute units to execute the one or more workloads to compensate for when an amount of processing required to complete the one or more workloads exceeds currently available processing resources on the coprocessor.
Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement, which is calculated to achieve the same purpose, may be substituted for the specific embodiments shown. Therefore, it is manifestly intended that this invention be limited only by the claims and the equivalents thereof
This application is a continuation of, and claims the benefit of, U.S. application Ser. No. 17/705,959, filed Mar. 28, 2022, and titled “PROCESSING ENGINE SCHEDULING FOR TIME-SPACE PARTITIONED PROCESSING SYSTEMS,” the contents of which is hereby incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 17705959 | Mar 2022 | US |
Child | 17707164 | US |