Computing systems often include a number of processing resources, such as processors or processor cores, which can retrieve instructions, execute instructions, and store the results of executed instructions to memory. A processing resource can include a number of functional units such as arithmetic logic units (ALUs), floating point units (FPUs), and combinatorial logic blocks, among others. Typically, such functional units are local to the processing resources. That is, functional units tend to be implemented as part of a processor and are separate from memory devices in which data to be operated upon is retrieved and data forming the results of operations is stored. Such data can be accessed via a bus between the processing resources and memory.
Processing performance can be improved by offloading operations that would normally be executed in the functional units to a processing-in-memory (PIM) device. PIM refers to an integration of compute and memory for execution of instructions that would otherwise be executed by a computer system's primary processor or processor cores. In some implementations, PIM devices incorporate both memory and functional units in a single component or chip. Although PIM is often implemented as processing that is incorporated ‘in’ memory, this specification does not limit PIM so. Instead, PIM may also include so-called processing-near-memory implementations and other accelerator architectures. That is, the term ‘PIM’ as used in this specification refers to any integration—whether in a single chip or separate chips—of compute and memory for execution of instructions that would otherwise be executed by a computer system's primary processor or processor core. In this way, instructions executed in a PIM architecture are executed ‘closer’ to the memory accessed in executing the instruction. A PIM device can therefore save time by reducing or eliminating external communications and can also conserve power that would otherwise be necessary to process memory communications between the processor and the memory.
As mentioned above, PIM architectures support operations to be performed in, at, or near to the memory module storing the data on which the operations are performed on or with. Such an architectures allows for improved computational efficiency through reduced data transfer as well as reduced power consumption. In some implementations, a PIM architecture supports offloading instructions from a host processor for execution in memory or near memory, such that bandwidth on the data link between the processor and the memory is conserved and power consumption of the processor is reduced. The execution of PIM instructions by a PIM device does not require loading data into local CPU/GPU registers and writing data from local CPU/GPU storage back to the memory. In fact, any processing element that is coupled to memory for execution of operations can benefit from PIM device execution.
Such host processor often supports multi-processing where multiple processes of the same or different applications are executed in parallel. In such a multi-processing environment, however, without protection, two or more processes can simultaneously access a shared PIM resource in a manner that results in functional incorrectness or security vulnerability. Concurrent access can result in functional incorrectness when, for example, two processes access the same PIM register. For example, assume process “A” loaded instructions into a PIM's local instruction store. During process A's PIM execution, suppose another process such as, for example, process “B” modifies this local instruction store. Process A's PIM code is then corrupted, and process A's PIM execution will return incorrect results. Similarly, process B can also access PIM registers by sending PIM memory operations and can corrupt the PIM register state as well, resulting in incorrect PIM phase execution of process A.
Additionally, such simultaneous access can also result in security vulnerabilities. For example, one process can create a side channel via PIM registers to another process's data without knowledge of that process. For example, if process B is malicious, process B can create a side channel via PIM registers by sending PIM memory operations that can leak PIM register information of process A into its own address space.
Accordingly, implementations in accordance with the present disclosure provide software and hardware support and resource management techniques for providing controlling access to a PIM device through the use of virtualization. For explanation, in the description below, a “PIM offload instruction” is executed by a processor core, a “PIM command” is generated and issued to a PIM device as a result of executing the PIM offload instruction, and a “PIM instruction” is executed by the PIM device. Implementations in accordance with the present disclosure prevent corruption of PIM configuration space, including a local instruction store (LIS) that stores PIM instructions for execution, PIM configuration registers, and the like. PIM orchestration operations are isolated by allowing and restricting only one process to orchestrate a PIM device or set of PIM devices resources at a time. It should also be noted that PIM memory/units has two distinct spaces; 1) a PIM configuration space used for configuring the PIM before the PIM operation, and 2) a PIM orchestration space, which is used to orchestrate execution of PIM operations. That is, the LIS component stores the PIM instructions that will be executed on the PIM device.
In one aspect, a PIM device can also be a PIM unit and “device,” or “unit” can be used interchangeably. In one aspect, as used herein “orchestrate” refers to the planning, coordinating, configuration and managing each operation related to a PIM. While examples in this disclosure discuss the applicability of the implementations to PIM technology, such examples should not be construed as limiting. Readers of skill in the art will appreciate that implementations disclosed here are applicable virtual partitioning and orchestration on a processor-in-memory (“PIM”) is disclosed.
To that end, various apparatus, agents, and methods are disclosed in this specification for process isolation for a PIM device using virtualization. An implementation is directed to an apparatus configured for such process isolation. The apparatus includes one or more processing cores and computer memory. The computer memory comprises computer program instructions that, when executed, cause the apparatus to carry out: receiving, from a process, a call to allocate a virtual address space, where the process stores a PIM configuration context in the virtual address space; allocating the virtual address space to the process including mapping a physical address space including configuration registers of the PIM device to the virtual address space only if the physical address space is not mapped to another process's virtual address space; and programming the PIM device configuration space according to the configuration context.
In an implementation, the configuration context includes a virtual instruction store comprising a plurality of entries, with each entry including a PIM instruction opcode and the virtual instruction store is mapped to a local instruction store utilized by the PIM device. In such an implementation, the apparatus also includes computer program instructions that, when executed, cause the apparatus to carry out: responsive to receiving a PIM command that includes a virtual address indexing an entry in the virtual instruction store, translating the virtual address to a physical address indexing an entry of the local instruction store; and retrieving, from the entry of the local instruction store, the PIM instruction opcode.
In an implementation, the apparatus includes a translation mechanism that determines whether there is a valid mapping of the virtual address and translates the virtual address only if a valid mapping exists. In another implementation, the apparatus also includes a PIM agent that includes the translation mechanism.
In an implementation of the apparatus, allocating the virtual address space to the process includes mapping memory buffers on which PIM commands operate to one or more memory pages assigned to the process. In such an implementation, the apparatus includes computer program instructions that, when executed, cause the apparatus to carry out: receiving, from the process, a PIM command targeting a virtual memory address of a memory buffer; and translating the virtual memory address to a physical address of one of the memory buffers only if the physical address is included in one of the memory pages assigned to the process.
In an implementation, the apparatus of also includes computer program instructions that, when executed, cause the apparatus to carry out, while the virtual address space is allocated to the process: receiving, from a second process, a call to allocate a second virtual address space, where the second process stores a second PIM configuration context in the second virtual address space; allocating the second virtual address space to the second process, including mapping a different physical address space that includes different PIM device configuration registers to the second virtual address space only if the different physical address space is not mapped to another process's virtual address space; and programming the different PIM device configuration registers according to the second configuration context. In such an implementation, allocating the second virtual address space to the second process includes mapping different memory buffers on which PIM instructions operate to one or more different memory pages assigned to the second process. Additionally, the apparatus includes computer program instructions that, when executed, cause the apparatus to carry out: receiving, from the second process, a second PIM command targeting a second virtual memory address of a different memory buffer; and translating the second virtual memory address to a physical address of one of the different memory buffers only if the physical address is included in one of the memory pages assigned to the second process.
In an implementation, the computer program instructions can be a driver, an operating system, or a hypervisor. In some implementations, the PIM device is a component of the apparatus. In other implementations, the PIM device is a component that is separate from the apparatus.
A method of process isolation for a PIM device using virtualization is disclosed in this specification. In an implementation, the method includes receiving, from a process, a call to allocate a virtual address space. The process stores a PIM configuration context in the virtual address space. The method also includes allocating the virtual address space to the process including mapping a physical address space that comprises configuration registers of the PIM device to the virtual address space. Such a mapping is carried out only if the physical address space is not mapped to another process's virtual address space/The method also includes programming the PIM device configuration space according to the configuration context.
In an implementation, the configuration context includes a virtual instruction store comprising a plurality of entries, with each entry including a PIM instruction opcode and the virtual instruction store is mapped to a local instruction store utilized by the PIM device. In such an implementation, the method also includes: responsive to receiving a PIM instruction that includes a virtual address indexing an entry in the virtual instruction store, translating the virtual address to a physical address indexing an entry of the local instruction store; and retrieving, from the entry of the local instruction store, the opcode of the PIM instruction.
In an implementation of the method, allocating the virtual address space to the process includes mapping memory buffers on which PIM instructions operate to one or more memory pages assigned to the process, and the method also includes: receiving, from the process, a PIM instruction targeting a virtual memory address of a memory buffer; and translating the virtual memory address to a physical address of one of the memory buffers only if the physical address is included in one of the memory pages assigned to the process.
In an implementation, while the virtual address space is allocated to the process, the method includes receiving, from a second process, a call to allocate a second virtual address space, where the second process stores a second PIM configuration context in the second virtual address space; allocating the second virtual address space to the second process, including mapping a different physical address space that comprises different PIM device configuration registers to the second virtual address space only if the different physical address space is not mapped to another process's virtual address space; and programming the different PIM device configuration registers according to the second configuration context. In such an implementation, allocating the second virtual address space to the second process includes mapping different memory buffers on which PIM instructions operate to one or more different memory pages assigned to the second process, and the method includes: receiving, from the second process, a second PIM instruction targeting a second virtual memory address of a different memory buffer; and translating the second virtual memory address to a physical address of one of the different memory buffers only if the physical address is included in one of the memory pages assigned to the second process.
Implementations in accordance with the present disclosure will be described in further detail beginning with
The example system 100 of
The GPU is a graphics and video rendering device for computers, workstations, game consoles, and similar digital processing devices. A GPU is generally implemented as a co-processor component to the CPU of a computer. The GPU can be discrete or integrated. For example, the GPU can be provided in the form of an add-in card (e.g., video card), stand-alone co-processor, or as functionality that is integrated directly into the motherboard of the computer or into other devices.
The phrase accelerated processing unit (“APU”) is considered to be a broad expression. The term ‘APU’ refers to any cooperating collection of hardware and/or software that performs those functions and computations associated with accelerating graphics processing tasks, data parallel tasks, nested data parallel tasks in an accelerated manner compared to conventional CPUs, conventional GPUs, software and/or combinations thereof. For example, an APU is a processing unit (e.g., processing chip/device) that can function both as a central processing unit (“CPU”) and a graphics processing unit (“GPU”). An APU can be a chip that includes additional processing capabilities used to accelerate one or more types of computations outside of a general-purpose CPU. In one implementation, an APU can include a general-purpose CPU integrated on a same die with a GPU, a FPGA, machine learning processors, digital signal processors (DSPs), and audio/sound processors, or other processing unit, thus improving data transfer rates between these units while reducing power consumption. In some implementations, an APU can include video processing and other application-specific accelerators.
It should be noted that the terms processing in memory (PIM), processing near-memory (PNM), or processing in or near-memory (PINM), all refer a device (or unit) which includes a non-transitory computer readable memory device, such as dynamic random access memory (DRAM), and one or more processing elements. The memory and processing elements can be located on the same chip, within the same package, or can otherwise be tightly coupled. For example, a PNM device could include a stacked memory having several memory layers stacked on a base die, where the base die includes a processing device that provides near-memory processing capabilities.
The host device 130 of
In an implementation, the processor cores 102, 104, 106, 108 operate according to an extended instruction set architecture (ISA) that includes explicit support for PIM offload instructions that are offloaded to a PIM device for execution. Examples of PIM offload instruction include a PIM_Load and PIM_Store instruction among others. In another implementation, the processor cores operate according to an ISA that does not expressly include support for PIM offload instructions. In such an implementation, a PIM driver, hypervisor, or operating system provides an ability for a process to allocate a virtual memory address range that is utilized exclusively for PIM offload instructions. An instruction referencing a location within the aperture will be identified as a PIM offload instruction.
In the implementation in which the processor cores operate according to an extended ISA that explicitly supports PIM instructions, a PIM instruction is completed by the processor cores 102, 104, 106, 108 when virtual and physical memory addresses associated with the PIM instruction are generated, operand values in processor registers become available, and memory consistency checks have completed. The operation (e.g., load, store, add, multiply) indicated in the PIM instruction is not executed on the processor core and is instead offloaded for execution on the PIM device. Once the PIM offload instruction is complete in the processor core, the processor core issues a PIM command, operand values, memory addresses, and other metadata to the PIM device. In this way, the workload on the processor cores 102, 104, 106, 108 is alleviated by offloading an operation for execution on a device external to or remote from the processor cores 102, 104, 106, 108.
The memory addresses of the PIM command refer to, among other things, an entry in a local instruction store that stores a PIM instruction that is to be executed by at least one PIM device 181. In the example of
A PIM instruction can move data between the registers and memory, and it can also trigger computation on this data in the ALU 116. In some examples, the execution unit also includes a local instruction store (LIS) 122 that stores commands of PIM instructions written into the LIS by the host processor 132. In these examples, the PIM instructions include a pointer to an index in the LIS 122 that includes the operations to be executed in response to receiving the PIM instruction. For example, the LIS 122 holds the actual opcodes and operands of each PIM instruction.
The execution unit 150 is a PIM device 181 that is included in a PIM-enabled memory device 180 (e.g., a remote memory device) having one or more DRAM arrays. In such an implementation, PIM instructions direct the PIM device 181 to execute an operation on data stored in the PIM-enabled memory device 180. For example, operators of PIM instructions include load, store, and arithmetic operators, and operands of PIM instructions can include architected PIM registers, memory addresses, and values from core registers or other core-computed values. The ISA can define the set of architected PIM registers (e.g., eight indexed registers).
In some examples, there is one execution unit per DRAM component (e.g., bank, channel, chip, rank, module, die, etc.), thus the PIM-enabled memory device 180 include multiple execution units 150 that are PIM devices. PIM commands issued from the processor cores 102, 104, 106, 108 can access data from DRAM by opening/closing rows and reading/writing columns (like conventional DRAM commands do). In some implementations, the host processor 132 issues PIM commands to the ALU 116 of each execution unit 150. In implementations with a LIS 122, the host processor 132 issues commands that include an index into a line of the LIS holding the PIM instruction to be executed by the ALU 116. In these implementations with a LIS 122, the host-memory interface does not require modification with additional command pins to cover all the possible opcodes needed for PIM. Each PIM command carries a target address that is used to direct it to the appropriate PIM unit(s) as well as the PIM instruction to be performed. An execution unit 150 can operate on a distinct subset of the physical address space. When a PIM command reaches the execution unit 150, it is serialized with other PIM commands and memory accesses to DRAM targeting the same subset of the physical address space.
The execution unit 150 is characterized by faster access to data relative to the host processor 132. The execution unit 150 operates at the direction of the processor cores 102, 104, 106, 108 to execute memory intensive tasks. In the example of
The host device 130 also includes at least one memory controller 140 that is shared by the processor cores 102, 104, 106, 108 for accessing a channel of the PIM-enabled memory device 180. In some implementations, the host device 130 can include multiple memory controllers, each corresponding to a different memory channel in the PIM-enabled memory device 180. In some examples, the memory controller 140 is also used by the processor cores 102, 104, 106, 108 for executing one or more processes 172, 174, 176, and 178 and offloading PIM instructions for execution by the execution unit 150.
The memory controller 140 maintains one or more dispatch queues for queuing commands to be dispatched to a memory channel or other memory partition. Stored in memory and executed by the processor cores 102, 104, 106, 108 is an operating system 125 and a PIM driver 124.
Stored in the memory array 182 is an operating system 125 and a PIIM driver 124. The OS 125 and PIM driver are executed by any of the processor cores of the host processor 132. In an implementation, the PIM Driver 124 provides virtualization of PIM configuration and orchestration resources in order to provide isolation between executing processes 172, 174, 176, 178 according to embodiments of the present disclosure.
In the example of
The configuration context stored by the process 172 in virtual memory and then later programmed into the configuration space of the PIM device 181 includes a virtual instruction store. The virtual instruction store includes a set of entries, with each entry storing a PIM instruction opcode. The virtual instruction store is mapped to the LIS that is utilized by the PIM device. Readers should note that the LIS 122 is said to be ‘utilized’ by the PIM device because in some implementations the LIS is implemented as part of the PIM device while in other implementations the LIS is implemented as a component separate from, but near to the PIM device. The PIM driver 124 programs a translation mechanism (such as a TLB or page table entry) with the mappings of virtual instruction store entries to physical LIS entries. When a PIM command that includes a virtual address indexing an entry in the virtual instruction store is executed, the translation mechanism determines whether a mapping is valid and, if so, performs the translation of the virtual address to a physical address indexing an entry of the LIS 155. The opcode of the PIM instruction can then be retrieved from the entry in the LIS for execution by the PIM device 181. If there is no valid mapping—meaning virtual address included with the PIM command is not mapped to a physical address in the translation mechanism—the translation fails. In this way, isolation of the PIM resources is enforced by the translation mechanism. For example, if malicious process were to try to access a PIM instruction at a particular address, the translation mechanism will effectively enforce isolation because that malicious process will not have an address with a valid translation.
In an implementation, the PIM driver, when allocating the virtual memory address space to a requesting process 172, maps memory buffers on which PIM instructions operate to one or more memory pages assigned to the process. The PIM driver 124 then receives, from the process 172, a PIM instruction targeting a virtual memory address of the memory buffer and translates the virtual memory address to a physical address of one of the memory buffers only if the physical address is included in one of the memory pages assigned to the process. That is, operands of PIM instructions are referenced by a process through a virtual memory address and the if the virtual memory address is mapped to a physical memory page that is assigned to the process, the translation of the virtual memory address to physical address can be carried out. If the process is not assigned to the physical memory page, the translation is not carried out and the process cannot access the physical memory buffer of the PIM device that stores the operand upon which the process's PIM instruction is attempting to operate. In this way, one process cannot access data that is to be used by another process's PIM instruction. This, in turn, removes the risk of functional incorrectness through multiprocess conflicts.
Readers will recognize that while a PIM driver 124 is described here as carrying out the virtualization of the PIM resources to ensure process isolation, other components and modules can perform such virtualization. Any one or combination of the PIM driver 124, operating system 125, a hypervisor (not shown here) can perform such virtualization. Enforcing the virtualization is generally carried out by a translation mechanism in the processor cores or host processor, or a standalone component in the form of a PIM agent.
In an implementation, the apparatus of
In some implementations, a PIM agent 160 is responsible for PIM configuration and/or orchestration of the execution unit 150. The PIM agent 160 can be the processor cores 102, 104, 106, 108. In some implementations, the PIM agent 160 can be a dedicated processing element such as, for example, a platform security processor or direct memory controller (“DMA”) microcontroller, or any other system agent of the system 100. By way of illustration only,
The application or software code on the PIM agent 160 runs in the user address space. The user application will make an API call provided by the PIM runtime or PIM library for configuring and orchestrating PIM. A processor core 102, 104, 106, 108 (or driver) can also offload the PIM configuration work to the PIM agent 160 and the PIM agent 160 runs in the driver's address space). Multiple users can launch orchestration work on the PIM and consequently PIM agent 160 receives multiple orchestration command. The PIM agent 160 concurrently launches work from different users. The virtualized LIS 155 will provide the necessary isolation.
In some implementations, the PIM agent 160 manages PIM resources at the host level before reaching any DRAM channels of the PIM-enabled memory device 180 that host PIM unit to which the thread can dispatch work. That is, the PIM agent 160, using a translation mechanism 145 and the LIS component 155, determines whether a process has a valid virtual-to-physical mapping with the execution unit 150. If there is a valid virtual-to-physical mapping, the translation mechanism 145 provides the physical address of the translation and the PIM agent issues a PIM command to the execution unit 150 with the physical address. If there is not a valid virtual-to-physical mapping, the translation mechanism faults and the process cannot access the PIM resources of the PIM device 181.
In the example of
In the flow chart of
The flow chart of
If the physical address of the configuration space of the PIM device is not currently mapped to any other process, then the PIM driver can allocate 404 the virtual address space by mapping 408 memory buffers on which PIM instructions operate to one or more memory pages assigned to the process.
The flow chart of
The flow chart of
The PIM device then utilizes the translated address (the physical address) to retrieve 416, from the entry of the local instruction store, a PIM instruction opcode. The opcode specifies an operation to be performed by the PIM device.
The flow chart also includes receiving 418, from the process, a PIM command targeting a virtual memory address of a memory buffer of the PIM device and translating 420 the virtual memory address to a physical address of one of the memory buffers only if the physical address is included in one of the memory pages assigned to the process. The memory buffer stores the operand of the instruction. A translation mechanism translates 420 the virtual memory address to a physical address only if the requesting process has valid mappings. The translation mechanism inspects memory page attributes to determine whether the memory page holding the virtual memory address is assigned to the process ID of the requesting process. If it is, the translation can proceed. If the memory page is assigned to another process ID, then translation fails. In this way, one process cannot access PIM memory buffers ‘owned’ or utilized by another process.
The method of
The method of
The allocation 504 of the second virtual address space also includes mapping different memory buffers on which PIM instructions operate to one or more different memory pages assigned to the second process. The PIM driver then receives, from the second process, a second PIM instruction (different from the first PIM instruction of the first requesting process) targeting a second virtual memory address of a different memory buffer. The PIM driver translates the second virtual memory address to a physical address of one of the different memory buffers only if the physical address is included in one of the memory pages assigned to the second process.
In this way, the module providing process isolation blocks effectively assigns groups of PIM resources to different processes and ensures that no other process can utilize those resources when assigned. Security and functional correctness of the data is thus ensured.
Implementations can be a system, an apparatus, a method, and/or logic. Computer readable program instructions in the present disclosure can be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. In some implementations, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) can execute the computer readable program instructions by utilizing state information of the computer readable program instructions.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and logic circuitry according to some implementations of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by logic circuitry.
The logic circuitry can be implemented in a processor, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the processor, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and logic circuitry according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams can represent a module, segment, or portion of instructions, which includes one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block can occur out of the order noted in the figures. For example, two blocks shown in succession can, in fact, be executed substantially concurrently, or the blocks can sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
While the present disclosure has been particularly shown and described with reference to implementations thereof, it will be understood that various changes in form and details can be made therein without departing from the spirit and scope of the following claims. Therefore, the implementations described herein should be considered in a descriptive sense only and not for purposes of limitation. The present disclosure is defined not by the detailed description but by the appended claims, and all differences within the scope will be construed as being included in the present disclosure.