Computing systems often include a number of processing resources (e.g., one or more processors), which may retrieve and execute instructions and store the results of the executed instructions to a suitable location. A processing resource (e.g., central processing unit (CPU) or graphics processing unit (GPU)) can comprise a number of functional units such as arithmetic logic unit (ALU) circuitry, floating point unit (FPU) circuitry, and/or a combinatorial logic block, for example, which can be used to execute instructions by performing arithmetic operations on data. For example, functional unit circuitry may be used to perform arithmetic operations such as addition, subtraction, multiplication, and/or division on operands. Typically, the processing resources (e.g., processor and/or associated functional unit circuitry) may be external to a memory array, and data is accessed via a bus or interconnect between the processing resources and the memory array to execute a set of instructions. To reduce the amount of accesses to fetch or store data in the memory array, computing systems may employ a cache hierarchy that temporarily stores recently accessed or modified data for use by a processing resource or a group of processing resources. However, processing performance may be further improved by offloading certain operations to a memory-based execution device in which processing resources are implemented internal and/or near to a memory, such that data processing is performed closer to the memory location storing the data rather than bringing the data closer to the processing resource. A memory-based execution device may save time by reducing external communications (i.e., processor to memory array communications) and may also conserve power.
Remote execution devices may be used by processors (e.g., central processing units (CPUs) and graphic processing units (GPUs)) to speed up computations that are memory intensive. These remote execution devices may be implemented in or near memory to facilitate the fast transfer of data. One example of a remote execution device is a processing-in-memory (PIM) device. PIM technology is advantageous in the evolution of massively multi-parallel systems like GPUs. To implement PIM architectures in such systems, PIM architectures should work in multi-process and multi-tenant environments. A PIM architecture allows some of the processor's computations to be offloaded to PIM-enabled memory banks to offset data transfer times and speed up overall execution. To speedup memory intensive applications, PIM-enabled memory banks contain local storage and an arithmetic logic unit (ALU) that allow perform computation at the memory level. However, resource virtualization for PIM devices is lacking. The lack of this virtualization of resources restricts the current execution model of PIM-enabled task streams to a sequential execution model, where a PIM task must execute all its instructions to completion and write all its data to the bank before ceding control of the PIM banks to the next task in the stream. However, such an execution model is extremely inefficient and degrades performance for massively parallel systems like GPUs where multiple tasks must co-execute in order to efficiently utilize the compute power and memory bandwidth available. Additionally, GPUs may provide independent forward progress guarantees for kernels in different queues since the introduction of Open Computing Language (OpenCL) queues and Compute Unified Device Architecture (CUDA) streams with the lack of PIM resource virtualization breaks these guarantees.
Embodiments in accordance with the present disclosure provide resource virtualization for remote execution devices such as PIM-enabled systems. For example, PIM resource virtualization facilitates handling of multiple contexts from different in-flight tasks and may provide significant performance improvements over sequential execution in PIM-enabled systems. Resource virtualization of remote execution device described herein can ensure correct execution by maintaining the independent forward progress guarantees at the PIM task level. The resource virtualization techniques described herein allow PIM architectures to execute concurrent applications, thereby improving system performance and overall utilization. These techniques are well suited for applications such as memory intensive applications, graphical applications as well as machine learning applications.
An embodiment is directed to a method of virtualizing resources of a memory-based execution device. The method includes orchestrating the execution of two or more offload tasks on a remote execution device and initiating a context switch on the remote execution device from a first offload task to a second offload task. In some implementations, orchestrating the execution of two or more offload tasks on the remote execution device includes concurrently scheduling the two or more offload tasks in two or more respective queues and, at the outset of a task execution interval, selecting one offload task from the two or more queues for access to the remote execution device. In some examples, the task execution interval is a fixed amount of time allotted to each of the two or more offload tasks and each of the two or more queues is serviced for a duration of the task execution interval according to a round-robin scheduling policy.
In some implementations, initiating a context switch on the remote execution device from a first offload task to a second offload task includes initiating the storing of context state data in context storage on the remote execution device. In some implementations, the method also includes restoring the context of the second offload task in the remote execution device.
In some implementations, the remote execution device includes a processing-in-memory (PIM) unit coupled to a memory array. In these implementations, the two or more offload tasks are PIM tasks. In various examples, the context storage may be located in a reserved section of the memory array coupled to the PIM unit or in a storage buffer of a memory interface component coupled to the remote execution device.
Another embodiment is directed to a computing device for virtualizing resources of a memory-based execution device. The computing device is configured to orchestrate the execution of two or more offload tasks on a remote execution device and initiate a context switch on the remote execution device from a first offload task to a second offload task. In some implementations, orchestrating the execution of two or more offload tasks on the remote execution device includes concurrently scheduling the two or more offload tasks in two or more respective queues and, at the outset of a task execution interval, selecting one offload task from the two or more queues for exclusive access to the remote execution device.
In some implementations, initiating a context switch on the remote execution device from a first offload task to a second offload task includes initiating the storing of context state data in context storage on the remote execution device. In some implementations, the computing device is further configured to restore the context of the second offload task in the remote execution device.
In some implementations, the remote execution device includes a processing-in-memory (PIM) unit coupled to a memory array and wherein the two or more offload tasks are two or more PIM tasks. In various example, the context storage is located in a reserved section of the memory array coupled to the PIM unit or in a storage buffer of a memory interface component coupled to the remote execution device.
Yet another embodiment is directed to a system for virtualizing resources of a memory-based execution device. The system comprises a processing-in-memory (PIM) enabled memory device and a computing device communicatively coupled to the PIM-enabled memory device. The computing device is configured to orchestrate the execution of two or more PIM tasks on the PIM-enabled memory device and initiate a context switch on the PIM-enabled memory device from a first PIM task to a second PIM task. In some implementations, orchestrating the execution of two or more PIM tasks on the PIM-enabled memory device includes concurrently scheduling the two or more PIM tasks in two or more respective queues and, at the outset of a task execution interval, selecting one PIM task from the two or more queues for exclusive access to the PIM-enabled memory device.
In some implementations, initiating a context switch on the PIM-enabled memory device from a first PIM task to a second PIM task includes initiating the storing of context state data in context storage on the PIM-enabled memory device. In some implementations, the computing device is further configured to restore the context of the second offload task in the remote execution device.
In various implementations, the PIM-enabled memory device includes a PIM execution unit coupled to a memory array and the context storage is located in a reserved section of the memory array.
Embodiments in accordance with the present disclosure will be described in further detail beginning with
In some examples, the memory controller 106 is also used by the host 102 for offloading tasks for remote execution. In these examples, an offload task is a set of instructions or commands that direct a device external to the computing device 150 to carry out a sequence of operations. In this way, the workload on the cores 104 is alleviated by offloading the task for execution on the external device. For example, the offload task may be a processing-in-memory (PIM) task that includes a set of instructions or commands that direct a PIM device to carry out a sequence of operations on data stored in a PIM-enabled memory device.
The example system 100 of
In the example of
In some examples, the remote execution device 114 is a PIM-enabled memory device and the processing unit 116 is a PIM unit that is coupled to a memory array 124 corresponding to a bank of memory within the PIM-enabled memory device. In other examples, the remote execution device 114 is a PIM-enabled memory device and the processing unit 116 is a PIM unit that is coupled to multiple memory arrays 124 corresponding to multiple banks of memory within the PIM-enabled memory device. In one example, the remote execution device 114 is a PIM-enabled DRAM bank that includes a PIM unit (i.e., a processing unit 116) coupled to a DRAM memory array. In another example, the remote execution device 114 is a PIM-enabled DRAM die that includes a PIM unit (i.e., a processing unit 116) coupled to multiple DRAM memory arrays (i.e., multiple memory banks) on the die. In yet another example, the remote execution device 114 is a PIM-enabled stacked HBM that includes a PIM unit (i.e., a processing unit 116) on a memory interface die coupled to a memory array (i.e., memory bank) in a DRAM core die. Readers of skill in the art will appreciate that various configurations of PIM-enabled memory devices may be employed without departing from the spirit of embodiments of the present disclosure. In alternative examples, the remote execution device 114 includes an accelerator device (e.g., an accelerator die or Field Programmable Gate Array (FPGA) die) as the processing unit 116 that is coupled to a memory device (e.g., a memory die) that includes the memory array 124. In these examples, the accelerator device and memory device may be implemented in the same die stack or in the same semiconductor package. In such examples, the accelerator device is closely coupled to the memory device such that the accelerator can access data in the memory device faster than the computing device 150 can in most cases.
The example system 100 of
Task scheduling logic 112 in the task scheduler 108 performs a time-multiplexed scheduling between multiple tasks in the task queues 110 that are concurrently scheduled, where each task is given the full bandwidth of the processing unit 116 and memory array 124 to perform its operations. For example, in the example of
At the expiration of the interval, the currently executing task is preempted and a context switch to the next task is carried out. In some implementations, the task scheduling logic 112 carries out the context switch by directing the processing unit 116 to store its register state in context storage 128. The context storage 128 may be partitioned in the memory array 124 or located in a separate buffer of the remote execution device 114.
Consider an example where the remote execution device 114 is a PIM-enable memory bank that includes a PIM unit (i.e., processing unit 116) coupled to the memory array 124, and where tasks A, B, C, and D are concurrently scheduled PIM tasks (i.e., sets of PIM instructions to be executed in the PIM unit of the PIM-enabled memory bank). As a trivial example, each task includes instructions for the PIM unit to load some data from the memory array 124 into the registers 120, perform an arithmetic operation on the data in the ALU 118, write the result to the registers 120, and commit the result to the memory array 124. At the outset, task A is allowed to execute for Y μsecs by issuing the instructions of the task to the PIM unit for execution. After the execution interval elapses, task A is preempted to allow task B to execute. For example, the task scheduling logic 112 may send an instruction to the PIM unit to perform a context switch. The state of the registers 120 is saved to the context storage 128 in the remote execution device prior to beginning execution of instructions for task B. Task B is then allowed to execute for Y μsecs before being preempted for task C, and task C is then allowed to execute for Y μsecs before being preempted for task D. When task D is preempted for the execution of task A, the register state is restored from context storage 128, and the task scheduling logic 112 resumes issuing instructions for task A.
For further explanation
In some examples, a memory bank 206 includes a memory array 210 that is a matrix of memory bit cells with word lines (rows) and bit lines (columns) that is coupled to a row buffer 212 that acts as a cache when reading or writing data to/from the memory array 210. For example, the memory array 210 may be an array of DRAM cells. The memory bank 206 also includes an I/O line sense amplifier (IOSA) 214 that amplifies data read from the memory array 210 for output to the I/O bus (or to a PIM unit, as will be described below). The memory bank 206 may also include additional components not shown here, such as a row decoder, column decoder, command decoder, as well as additional sense amplifiers, drivers, signals, and buffers.
In some embodiments, a memory bank 206 includes a PIM unit 226 that performs PIM computations using data stored in the memory array 210. The PIM unit 226 includes a PIM ALU 218 capable of carrying out basic computations within the memory bank 206, and a PIM register file 220 that includes multiple PIM registers for storing the result of a PIM computation as well as for storing data from the memory array and/or host-generated data that are used as operands of the PIM computation. The PIM unit 226 also includes control logic 216 for loading data from the memory array 210 and host-generated data from the I/O bus into the PIM register file 220, as well for writing result data to the memory array 210. When a PIM computation or sequence of PIM computations is complete, the result(s) in the PIM register file 220 are written back to the memory array 210. By virtue of its physical proximity to the memory array 210, the PIM unit 226 is capable of completing a PIM task faster than if operand data were transmitted to the host for computation and result data was transmitted back to the memory array 210.
As previously discussed, a PIM task may include multiple individual PIM instructions. The result of the PIM task is written back to the memory array 210; however, intermediate data may be held the PIM register file 220 without being written to the memory array 210. Thus, to support preemption of a PIM task on a PIM unit 226 by a task scheduler (e.g., task scheduler 108 of
In some embodiments, the memory array 210 includes reserved memory 208 that stores context state data for PIM tasks executing on the PIM unit 226. In some implementations, the reserved memory 208 includes distinct context storage buffers 222-1, 222-2, 222-3 . . . 222-N corresponding to N processes supported by the host processor system (e.g., computing device 150 of
In alternative examples to the example depicted in
For further explanation
In some examples, like the memory bank 206 in
In some embodiments, the memory interface die 304 includes a context storage area 334 that stores context state data for PIM tasks executing on the PIM unit 226. In some implementations, the context storage area 334 includes distinct context storage buffers 322-1, 322-2, 322-3 . . . 322-N corresponding to N processes supported by the host processor system (e.g., computing device 150 of
In alternative examples to the example depicted in
For further explanation,
In one example, the offload tasks are PIM tasks that are to be remotely executed by a PIM-enabled memory device. A PIM task includes a set of PIM instructions that are to be executed by a PIM unit in the PIM-enabled memory device that are dispatched by the same offload task from a computing device. The PIM unit includes a PIM ALU and a PIM register file for executing the PIM instructions of the PIM task. The memory controller of a computing device issues the PIM instructions to the remote PIM-enabled memory device over a memory channel that includes the PIM unit. The PIM unit executes the PIM instructions within the PIM-enabled memory device. For example, the PIM unit may include a PIM unit (e.g., ALU, registers, and control logic) coupled to a memory array (e.g., in a memory bank) of the PIM-enabled memory device.
The example method of
Continuing the above example of a PIM-enabled memory device, the remote execution resources shared by the first offload task and the second offload task are PIM unit resources, including the PIM ALU and PIM register file. By supporting the preemption of PIM tasks in the offload task scheduler, the resources of the PIM unit in the PIM-enabled memory device may be virtualized. A context switch from a first PIM task to a second task, initiated by the offload task scheduler, causes the storing of the register state of registers in the PIM register file for the first and initialization of the PIM register file for the executing the second task.
For further explanation,
In the method of
Continuing the above example of a PIM-enabled memory device, PIM tasks are generated by processes/threads executing on processor cores of the host processor system and transmitted to the memory controller. PIM tasks are concurrently scheduled in PIM queues by the task scheduler for execution on the PIM-enabled memory device. Each concurrently scheduled PIM task in the PIM queues is allotted an amount of time for executing PIM tasks before being preempted to allow another PIM task in a different PIM queue to execute. Each PIM task includes a stream of instructions/commands that are issued to the remote PIM-enabled device from the memory controller. This stream is interrupted at the expiration of the interval so that a new stream corresponding to a second PIM task is allowed to issue for its allotted interval.
For further explanation,
In one example, the remote execution device is implemented in a memory die of a memory device and the context storage buffer is located on memory die with the remote execution device. For example, a stacked memory device (e.g., an HBM) includes multiple memory die cores and a base logic die. In this example, the context storage buffer is located on the same die or within the same memory bank as the remote execution device (e.g., a PIM unit). In another example, the remote execution device is implemented in a memory die of a memory device and the context storage buffer is located on a die that is separate from the memory die that includes the remote execution device. For example, a stacked memory device (e.g., an HBM) includes multiple memory die cores and a base logic die. In this example, the context storage buffer is located on the base logic die and the remote execution device (e.g., a PIM unit) is implemented on a memory die core.
Continuing the above example of a PIM-enabled memory device, the PIM-enabled memory device may be an HBM stacked memory device that includes PIM-enabled memory banks. In this example, each PIM-enabled memory bank includes a memory array and a PIM unit for executing PIM computations coupled to the memory array. In one implementation, the memory array includes a reserved area for storing context data. In this implementation, in response to a context switch initiated by the offload task scheduler, the state of the register file is stored in the reserved area of the memory array in a context buffer corresponding to the VMID of the offload task. Context storage buffers for each VMID are included in the reserved area of the memory array. In another implementation, the base logic die includes a context storage area for storing context data. In this implementation, in response to a context switch initiated by the offload task scheduler, the state of the register file is stored in the context storage area of the base logic die in a context buffer corresponding to the VMID of the offload task. Context storage buffers for each VMID are included in the context storage area of the base logic die.
For further explanation,
Continuing the above example of a PIM-enabled memory device, a context storage buffer is provided for each process, thread, or VMID executing on the host processor system. When one PIM task is preempted by the task scheduling logic, the register state of the PIM register file in the PIM unit is saved to the context storage buffer. When the task scheduling logic subsequently returns to the preempted PIM task, the context data (i.e., stored register state) is loaded into the PIM register file from the context storage buffer, thus restoring the state of the PIM task to allowing for continued execution of the PIM task.
In view of the above disclosure, readers will appreciate that embodiments of the present disclosure support the virtualization of resources in a remote execution device. Where the complexity of processing units in the remote execution device is far reduced from the complexity of a host processing system (as with, e.g., a PIM-enabled memory device), support for the virtualization of processing resources in the remote execution device is achieved by a task scheduler in the host computing system that manages execution of tasks on the remote execution device and provides context switching for those tasks. Context storage buffers on the remote execution device facilitate the context switching orchestrated by the task scheduler. In this way, context switching on the remote execution device is supported without implementing task scheduling logic on the remote execution device and without tracking the register state of the remote execution device in the host processing system. Such advantages are particularly borne out in PIM devices where saved contexts may be quickly loaded from context storage buffers in the memory associated with the PIM device to facilitate switching execution from one PIM task to another. Accordingly, serial execution of PIM tasks and starving processes of PIM resources may be avoided.
Embodiments can be a system, an apparatus, a method, and/or logic circuitry. Computer readable program instructions in the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and logic circuitry according to some embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by logic circuitry.
The logic circuitry may be implemented in a processor, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the processor, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the FIG.s illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and logic circuitry according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which includes one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the FIG.s. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
While the present disclosure has been particularly shown and described with reference to embodiments thereof, it will be understood that various changes in form and details may be made therein without departing from the spirit and scope of the following claims. Therefore, the embodiments described herein should be considered in a descriptive sense only and not for purposes of limitation. The present disclosure is defined not by the detailed description but by the appended claims, and all differences within the scope will be construed as being included in the present disclosure.