This relates to integrated circuits and, more particularly, to programmable integrated circuits in a hardware acceleration platform.
Programmable integrated circuits are a type of integrated circuit that can be programmed by a user to implement a desired custom logic function. In a typical scenario, a logic designer uses computer-aided design tools to design a custom logic circuit. When the design process is complete, the computer-aided design tools generate configuration data. The configuration data is loaded into memory elements on a programmable integrated circuit to configure the device to perform the functions of the custom logic circuit.
Programmable devices may be used for co-processing in big-data or fast-data applications. For example, programmable devices may be used in application acceleration tasks in a data center and may be reprogrammed during data center operation to perform different tasks. By offloading computationally intensive tasks from a host processor to highly-parallel acceleration resources on a programmable device (sometimes referred to as a co-processor or an acceleration processor), the host processor is freed up to perform other critical processing tasks. The use of programmable devices as hardware accelerators can therefore help deliver improved speeds, latency, power efficiency, and flexibility for end-to-end cloud computing, networking, storage, artificial intelligence, autonomous driving, virtual reality, augmented reality, gaming, and other data-centric applications.
With the substantial increase in the computing power of a single server in cloud data centers in recent years, certain workloads can significantly underutilize the available processing power of the server. This underutilization of available processing capability has spawned the adoption of server virtualization. Server virtualization allows multiple virtual machines (VMs) and containers to run as independent virtual servers on a single host processor/server. In a hardware acceleration platform, a programmable integrated circuit such as a field-programmable gate array (FPGA) device connected to the host processor is physically partitioned into independent regions, each of which is assigned to and serves a corresponding one of the virtual machines or containers. Physically partitioning programmable resources or time multiplexing of physically partitioned resources in this way enables the FPGA device to be used simultaneously by a limited number of virtual machines.
It is within this context that the embodiments described herein arise.
The present embodiments relate to methods and apparatus for bridging static device virtualization technologies such as single-root input/output virtualization (SR-IOV) or scalable input/output virtualization (Scalable IOV) to dynamically-reconfigurable programmable integrated circuits. A server may include a host processor that is accelerated using a programmable device such as a programmable logic device (PLD) or a field-programmable gate array (FPGA) device.
The programmable device may include a programmable resource interface circuit (PIC) responsible for programmable resource management functions and may also include one or more accelerator function units (AFUs) that can be dynamically reprogrammed to perform one or more tasks offloaded from the host processor. Each AFU may be further logically partitioned into virtualized AFU contexts (AFCs). The PIC may be a static region for mapping a particular a PCIe (Peripheral Component Interconnect Express) function such as a physical function or a virtual function to one or more AFCs in the programmable device via a virtualized communications interface. The host processor (e.g., host drivers) handles hardware discovery/enumeration and loading of device drivers such that security isolation and interface performance are maintained.
For example, the PIC may include mechanisms that enable the host processor to perform dynamic request/response routing with access isolation (e.g., for handling upstream memory requests from an AFC to the host processor, downstream memory requests from the host processor to an AFC, AFC interrupts, function level resets for an AFC, etc.) and to perform AFC discovery and identification (e.g., using a device feature list with programmable or virtualized device feature headers, updating an AFU/AFC identifier using a received AFU metadata file, etc.). Configured in this way, the virtualized AFCs on a single programmable device can be efficiently scaled to support and interface with more than four virtual machines (VMs) running on the host processor/service, more than 10 VMs, more than 20 VMs, more than 30 VMs, 10-100 VMs, more than 100 VMs, or hundreds/thousands of VMs, process containers, or host processes running on the host server. By using the PIC to create dynamic associations between the host/guest OS and the AFCs via available platform capabilities, the particular platform virtualization capabilities (e.g., SR-IOV or Scalable IOV) may be abstracted from the AFU developer so that the AFU developer sees a consistent interface across platforms.
It will be recognized by one skilled in the art, that the present exemplary embodiments may be practiced without some or all of these specific details. In other instances, well-known operations have not been described in detail in order not to unnecessarily obscure the present embodiments.
An illustrative embodiment of a programmable integrated circuit such as programmable logic device (PLD) 100 that may be configured to implement a circuit design is shown in
Programmable logic device 100 may contain programmable memory elements. Memory elements may be loaded with configuration data (also called programming data or configuration bitstream) using input-output elements (IOEs) 102. Once loaded, the memory elements each provide a corresponding static control signal that controls the operation of an associated functional block (e.g., LABs 110, DSP 120, RAM 130, or input-output elements 102).
In a typical scenario, the outputs of the loaded memory elements are applied to the gates of metal-oxide-semiconductor transistors in a functional block to turn certain transistors on or off and thereby configure the logic in the functional block including the routing paths. Programmable logic circuit elements that may be controlled in this way include parts of multiplexers (e.g., multiplexers used for forming routing paths in interconnect circuits), look-up tables, logic arrays, AND, OR, NAND, and NOR logic gates, pass gates, etc.
The memory elements may use any suitable volatile and/or non-volatile memory structures such as random-access-memory (RAM) cells, fuses, antifuses, programmable read-only-memory memory cells, mask-programmed and laser-programmed structures, combinations of these structures, etc. Because the memory elements are loaded with configuration data during programming, the memory elements are sometimes referred to as configuration memory, configuration random-access memory (CRAM), or programmable memory elements. Programmable logic device (PLD) 100 may be configured to implement a custom circuit design. For example, the configuration RAM may be programmed such that LABs 110, DSP 120, and RAM 130, programmable interconnect circuitry (i.e., vertical channels 140 and horizontal channels 150), and the input-output elements 102 form the circuit design implementation.
In addition, the programmable logic device may have input-output elements (IOEs) 102 for driving signals off of device 100 and for receiving signals from other devices. Input-output elements 102 may include parallel input-output circuitry, serial data transceiver circuitry, differential receiver and transmitter circuitry, or other circuitry used to connect one integrated circuit to another integrated circuit.
Device 100 may also include programmable interconnect circuitry in the form of vertical routing channels 140 (i.e., interconnects formed along a vertical axis of PLD 100) and horizontal routing channels 150 (i.e., interconnects formed along a horizontal axis of PLD 100), each routing channel including at least one track to route at least one wire. If desired, the interconnect circuitry may include pipeline elements, and the contents stored in these pipeline elements may be accessed during operation. For example, a programming circuit may provide read and write access to a pipeline element.
Note that other routing topologies, besides the topology of the interconnect circuitry depicted in
A host operating system (host OS) 204 is loaded on host CPU 202. Host operating system 204 may implement a hypervisor 216 that facilitates the use of one or more virtual machines (VM) 210-1 on host processor 202. Virtual machines are self-contained virtualized partitions that simulate an independent hardware computing resource. Hypervisor 216 may be part of the software or firmware running on host processor 202 and may serve as a virtual machine monitor (sometimes also referred to as a virtual machine manager or VMM) that manages the system's hardware resources so they are distributed efficiently among the virtual machines (VMs) on server 200.
Each virtual machine may be referred to as a guest machine running its own guest operating system (OS). In certain virtualization architectures, a given virtual machine such as VM0 (a VM hosting the host OS 204) may be serve as a parent partition, a control domain (“Domain 0”), or a root partition that manages machine-level functions such as device drivers, power management, and the addition/removal of VMs (e.g., VM0 has a control stack for managing virtual machine creation, destruction, and configuration). The control domain VM0 has special privileges, including the capability to access the hardware directly, handles all access to the system input-output functions, and interacts with any remaining VMs (e.g., VM1, VM2, etc.). Each virtual machine 210-1 may be used to run one or more user applications. Hypervisor 216 presents the VM's guest OS with a virtual operating platform and manages the execution of the guest operating systems while sharing virtualized hardware resources. Hypervisor 216 may run directly on the host's hardware (as a type-1 bare metal hypervisor) or may run on top of an existing host operating system (as a type-2 hosted hypervisor). If desired, additional virtualization drivers and tools (not shown) may be used to help each guest virtual machine communicate more efficiently with the underlying physical hardware of host CPU 202 or the hardware acceleration resources provided by programmable device 100. In general, processor 202 may be configured to host at least two VMs, two to ten VMs, more than 10 VMs, 10-100 VMs, more than 100 VMs, or any suitable number of virtual machines.
In addition to hosting virtual machines 210-1, CPU 202 may also host one or more containers 210-2 (sometimes referred to as process containers). In contrast to a virtual machine 210-1 that has a full guest OS with its own memory management, device drivers, and daemons, containers 210-2 share the host OS 204 and are relatively lighter weight and more portable than VMs. Containers 210-2 are a product of “operating system virtualization” and allow running multiple isolated user-space instances on the same host kernel (i.e., the host kernel is shared among the different containers each running its own workload). This isolation guarantees that any process inside a particular container cannot see any process or resource outside that container.
Similar to what hypervisor 216 provides for VM 210-1, a container manager/engine 211 in the user space may be used to run one or more containers 210-2 on server 200. Container manager 211 presents the isolated container applications with a virtual operating platform and manages the execution of the isolated container applications while sharing virtualized hardware resources. In general, processor 202 may be configured to host at least two containers, two to ten containers, more than 10 containers, 10-100 containers, 100-1000 containers, thousands of containers, or any suitable number of containers sharing the same host OS kernel. As examples, host CPU 202 may be configured to host process containers such as Docker, Linux containers, Windows Server containers, and/or other types of process containers. In other suitable embodiments, host CPU 202 may be configured to host machine containers such as Hyper-V containers developed by Microsoft Corporation, Kata containers managed by the OpenStack Foundation, and/or other types of light-weight virtual machines.
Furthermore, CPU 202 may host one or more processes 210-3 (sometimes referred to as host or application processes), which is similar to a container 210-2 in the sense that a process 210-3 has its own address space, program, CPU state, and process table entry. In contrast to container 210-2 which encapsulates one isolated program with all of its dependencies in its filesystem environment, process 210-3 does not encapsulate any of the dependent shared libraries in the filesystem environment. Process 210-3 can be run on top of the OS. In general, processor 202 may be configured to host at least two processes 210-3, two to ten processes, more than 10 processes, 10-100 processes, 100-1000 processes, thousands of processes, or any suitable number of processes on the kernel.
In general, the software running on host CPU 202 may be implemented using software code stored on non-transitory computer readable storage media (e.g., tangible computer readable storage media). The software code may sometimes be referred to as software, data, program instructions, instructions, script, or code. The non-transitory computer readable storage media may include non-volatile memory such as non-volatile random-access memory (NVRAM), one or more hard drives (e.g., magnetic drives or solid state drives), one or more removable flash drives or other removable media, or the like. Software stored on the non-transitory computer readable storage media may be executed on the processing circuitry of host processor 202.
Host processor 202 may communicate with programmable device 100 via a host interface such as host interface 230. Host interface 230 may be a computer bus interface such as the PCIe (Peripheral Component Interconnect Express) interface developed by the PCI Special Interest Group or PCI-SIG, Cache Coherent Interconnect for Accelerators (CCIX), Gen-Z, Open Coherent Accelerator Processor Interface (OpenCAPI), Intel Accelerator Link (IAL), Nvidia's NVLink, or other computer bus interfaces. In general, host interface 230 may be implemented using other multiple data lane (e.g., at least 2 lanes, at least 4 lanes, at least 8 lanes, at least 16 lanes, at least 32 lanes, at least 64 lanes, etc.), single data lane, parallel data bus, serial data bus, or other computer bus standards that can support data transfer rates of at least 250 MBps (megabytes per second), 500 MBps, 1 GBps (Gigabytes per second), 5 GBps, 10 GBps, 16 GBps, 32 GBps, 64 GBps, or more.
A single host processor 202 running virtual machines 210-1, containers 210-2, host processes 210-3, and other software-level entities (sometimes referred to collectively as “task offloading modules” that offload tasks to programmable accelerator device 100) will need to share host interface 230 among the various types of task offloading modules. To support this, The PCI-SIG has issued an extension to the PCIe specification called single-root input/output virtualization (SR-IOV), which allows the different virtual machines, containers, and software processes in a virtual environment to share a single PCIe hardware interface. The SR-IOV specification enables a single root function (e.g., an Ethernet port) to appear as multiple, separate physical devices, each with its own PCIe configuration space. A physical device with SR-IOV capabilities can be configured to appear in the PCI configuration space as multiple functions on a virtual bus. If desired, the techniques described herein may also be applied to Scalable IOV (SIOV) developed by Intel Corporation or other device virtualization standards.
Intel Scalable IOV is a PCIe-based virtualization technique that enables highly scalable and high performance sharing of I/O devices across isolated domains, while containing the cost and complexity for endpoint device hardware to support such scalable sharing. Depending on the usage model, the isolated domains may be virtual machines, process containers, machine containers, or application processes. Unlike the coarse-grained device partitioning approach of SR-IOV to create multiple VFs on a PF, Scalable IOV enables software to flexibly compose virtual devices utilizing the hardware-assists for device sharing at finer granularity. Performance critical operations on the composed virtual device are mapped directly to the underlying device hardware, while non-critical operations are emulated through device-specific composition software in the host.
SR-IOV defines two types of host-assignable interfaces or PCI functions spanning the hardware/software interface across programmable device 100 and host OS 204: (1) physical functions and (2) virtual functions. A physical function (PF) such as PF 232 is a full-feature PCIe function that is discovered, managed, and configured like typical PCI devices and has its own independent set of Base Address Registers (BARs). Physical function 232 may be managed by a physical function driver 206, which may be part of the kernel space of host OS 204. Physical function 232 configures and manages the SR-IOV functionality by assigning associated virtual functions (VFs) 234. Physical function 232 is therefore sometimes referred to as a “privileged” host-assignable interface. A single physical function 232 can have more than 10, more than 100, up to 256, more than 1000, more than 60000, more than 100000, or at least a million associated virtual functions 234. If desired, each host may also have more than one associated physical function 232.
In contrast to physical function 232, virtual functions 234 are light-weight PCIe functions (i.e., less complex than a PF) that can only control its own behavior. Each virtual function is derived from and share the resources of an associated physical function. Each virtual function 234 may also have its own independent set of Base Address Registers (BARs) and may be managed by a virtual function driver 214 (which is part of the kernel space of the guest OS running on one of the virtual machines 210-1), a user mode driver (which is part of the user space of the program running on one of the containers), or a user mode driver (which is part of the user space of the program running as part of one of the application processes 210-3). In other words, a virtual function may also be assigned to a container 210-2 for enabling hardware offload. In other scenarios, the virtual function BARs are mapped to the address space of process 210-3 to allow hardware submission. In contrast to the physical function which is considered a privileged host-assignable interface, virtual function 234 is therefore sometimes referred to as an “unprivileged” host-assignable interface. In general, each virtual machine 210-1, container 210-2, or process 210-3 can map to one or more virtual functions. Configured in this way, virtual functions have near-native performance and provide better performance than para-virtualized drivers and emulated access and also increases virtualized guest density on host servers within a data center.
Programmable device 100 may include logic to implement physical function 232 and virtual functions 234. Still referring to
Programmable device 100 may also include one or more user-programmable regions that can be dynamically and partially reconfigured based on user requests or workload needs. The process where only a subset of programmable logic on device 100 is loaded is sometimes referred to as “partial reconfiguration.” Each of these partial reconfiguration (PR) regions may serve as a “slot” or container in which an accelerator function unit (AFU) 252 can be loaded. Accelerator function unit 252 is used for offloading a computation operation for an application from host CPU 202 to improve performance. As described above, physical function 232 is primarily responsible for management functions. Thus, the dynamic programming of AFU 252 may be supervised by the physical function driver. The example of
In accordance with an embodiment, an AFU region 252 may be further logically subdivided into different AFU contexts (AFCs) 256, each of which serves as a separate unit of acceleration that can be independently assigned to a corresponding distinct software offloading entity such as VM 210-1, container 210-2, or host process 210-3. In general, each task offloading entity may map to one or more virtual functions.
Each AFC 256 may be a slice of the same accelerator (e.g., as part of the same AFU), a completely different accelerator (e.g., as part of different AFUs), a set of portal registers for enabling work submission to shared accelerator resources, or other logical division. For example, a compression AFU with eight engines could expose the AFCs as spatially shared (i.e., exposing each of the 8 compression engines as a distinct AFC), or temporally shared (i.e., exposing a set of portal/doorbell registers that allows descriptor submission to actual engines that are time-shared or time-multiplexed between different AFCs).
In general, all AFCs 256 within an AFU 252 are reconfigured together and are connected to PIC 250 via a single communications interface 254. The PF driver in the host enumerates programmable device 100 and manages PIC 250. The host drivers collectively identify the different AFCs and create associations between the hardware associated with each AFC and the corresponding user application software/driver. As an example, one AFC 256 may be mapped to one virtual function 234 or physical function 232. This is merely illustrative. As another example, multiple AFCs 256 (which may be part of the same AFU or different AFUs) can be mapped to one VF 234. If desired, one AFC 256 might also be mapped to two or more virtual functions.
The example described above in which an AFC is mapped to a PCIe physical function or a PCIe SR-IOV virtual function is merely illustrative. In general, an AFC may be mapped to any host-assignable interface such as Scalable IOV Assignable Device interface (ADI) or other host-assignable interfaces associated with non-PCIe host interfaces.
To support the logical subdivision of an AFU 252 into multiple AFCs 256, communication interface 254 may be virtualized to support AFU sharing. In particular, all transactions crossing interface 254 may be tagged with a context identifier (ID) tag. This context ID tag is used to ensure proper routing of requests and responses between the software applications on the host and the AFCs as well as providing security isolation.
In one suitable arrangement, the context ID tag may be used to provide address isolation between different AFC memory requests. All “upstream” requests from an AFC 256 (e.g., memory requests from an AFC to the host CPU) are tagged with a context_id, which expands the namespace of each memory request to <context_id, address>, where address might represent the physical address or the virtual address. PIC 250 may map the specific context_id to a platform specific identifier using a context mapping table 260. After SR-IOV is enabled in the physical function, the PCI configuration space of each virtual function can be accessed by the bus, device, and function number of that virtual function (sometimes referred to collectively as the “BDF number”).
In any case, PIC 250 provides a level of abstraction to each AFU 252 and is responsible for converting the memory request namespace. Operating a virtualized acceleration platform in this way provides the technical advantage of decoupling an AFU's communication semantic from the underlying platform semantics (e.g., from PCIe specific semantics). In general, context_id should be wide enough to only cover the number of supported AFCs, which is likely to be much less than the platform namespace identifier width. Reducing or minimizing the size or bit-width of context_id can save invaluable wire space on device 100.
A response from the host CPU to an upstream request from an AFC or an MMIO transaction initiated by the host needs to be routed back to that AFC. This routing mechanism may depend on how the AFCs are connected to the PCIe physical and virtual functions. In a first mode, at least some or perhaps all of the AFCs are associated with the physical function and accessed through the physical function. The first mode is therefore sometimes referred to as the “PF mode.” In a second mode, each AFC is associated with a single virtual function, and all host communications for an AFC happens through the associated virtual function. The second mode is therefore sometimes referred to as the “VF mode.” The system may also be operated in a third (mixed) mode, where some AFCs are mapped to the physical function while other AFCs are mapped to corresponding virtual functions.
In the PF mode, the physical function's BDF number is used for multiple AFCs, so the BDF number by itself cannot determine the context_id. To address this issue in the PF mode, PCIe has a mechanism whereby a request can be associated with a tag, and the response is guaranteed to return the same tag. For example, the context_id of the request can be saved in an internal table (see, e.g., table 262 in
In the VF mode, given the PCIe BDF number of the virtual function, PIC 250 can uniquely determine the context_id of the AFC. For “downstream” requests (e.g., memory requests from the host CPU back to an AFC or an MMIO transaction initiated by the host CPU), an inverse context mapping table such as inverse context mapping table 260-2 of
Some embodiments may map the AFC namespace to the physical function for host access. In the PF mode (i.e., when all AFCS are mapped to the physical function), all requests such as memory-mapped input/output (MMIO) requests will use the same platform-specific identifier. In this case, the MMIO address range of the physical function will have to be big enough to map the total MMIO space across all AFCs. An additional address decoder such as address decoder 264 provided as part of PIC 250 can then be used to decode the context_id from a given MMIO address associated with the physical function BAR. If desired, address decoder 264 can also be formed as part of an AFU for decoding AFU addresses and/or part of an AFC for decoding AFC addresses. In the mixed mode where some AFCs are mapped to the physical function and other AFCs are mapped to virtual functions, storing a tag in internal table 262, using an inverse mapping table to identify platform-specific identifiers, using address decoder 264 to decode an MMIO address, and/or some combination of these techniques may be used simultaneously.
Similar to issuing upstream memory requests, AFCs can optionally issue interrupts. Interrupt requests from an AFC are also tagged with a context_id, which is handled in a similar way as the upstream request flow shown and described in connection with
Certain specifications such as SR-IOV require devices to support function-level reset (FLR) per virtual function. In some scenarios such as during the PF mode or the mixed mode, there could be device management registers which are utilized to trigger context-level reset (CLR) for one or more AFCs. The context-level reset might also be initiated via a function-level reset. Moreover, updates to context mapping table 260 also require a way to quiesce corresponding affected AFCs.
At step 400, a virtual function may initiate a context-level reset (CLR) on a specific AFC. At step 402, a front-end module in PIC 250 may be used to filter transactions targeted to the specific AFC. For example, PIC 250 may be directed to gracefully drop, drain, flush, or otherwise handle all upstream requests from the specified AFC and/or gracefully drop/drain/flush or otherwise handle all downstream requests targeted to the specific AFC. At step 404, the AFC reset may begin by using PIC 250 to send a CLR message to the specific AFC. In response to receive the CLR message, the specific AFC is expected to stop issuing requests tagged with the specific context_id within X cycles (e.g., within 1-2 clock cycles, 1-5 clock cycles, 1-10 clock cycles, 10-100 clock cycles, etc.). At step 406, the AFC may optionally return a CLR acknowledgement back to PIC 250.
At step 408, PIC 250 may wait until all outstanding transactions from the specific AFC are flushed out (e.g., until all pending upstream requests have received a response from the host CPU and until all pending downstream requests have received a response from the AFC or have timed out). At step 410, the front-end transaction filtering of PIC 250 described in connection with step 402 may be removed or suspended.
At step 412, the CLR operation may be terminated by allowing the specific AFC to resume sending and receiving requests to and from the host CPU. To end a specific CLR, PIC 250 may send a CLR end message to the specific AFC under reset, the specific AFC under reset may send a CLR end message to PIC 250, or PIC 250 may send a new CLR request to a different AFC (assuming PIC 250 can only handle one AFC reset at any given point in time).
The steps of
In accordance with another embodiment, programmable device 100 may be configured to maintain a device feature list (DFL) that allows host CPU 202 or the software running on host CPU 202 to discover, enumerate, and identify which AFCs 256 exist on programmable accelerator device 100. In particular, the physical function on the host processor may manage, reprogram, or update the device feature list. The device feature list maps access to particular AFCs to appropriate user applications running on the host CPU (e.g., to allow different AFCs within a single AFU to be shared by multiple user applications). A DFL is a data structure that includes a linked list of device feature headers (DFHs). Each DFH describes a specific sub-feature of device 100. For example, each DFH provides identify information such that the corresponding feature can be discovered by the host software, as well as describing relevant resources and attributes related to that feature. As will be described below in connection with
The PIC DFH may point to a Port-0 DFH, which exposes port control registers 512. Registers 512 are port-level registers for controlling the interface between the PIC and a corresponding PR region/slot. Registers 512 may, for example, include registers for programming Port-0, registers for controlling the type of ingress and egress data packets through Port-0, registers for signaling errors, registers used for debug or troubleshooting purposes, or other registers meant for management functions specific to Port-0. If desired, the PIC DFH may also point to DFHs associated with other ports (e.g., Port-1 DFH, Port-2 DFH, Port-3, DFH, etc.). The Port-0 DFH may also be used to read an AFU identifier (ID) and the location (e.g., in accordance with the MMIO offset and MMIO length) of corresponding AFU registers in the user partition. The AFU ID may help identify what is loaded in each AFU.
The Port-0 DFH may point to a series of AFC DFHs (e.g., to a first AFC-0 DFH, which points to a second AFC-1 DFH, . . . , which points to a last AFC-M DFH). Each of these AFC DFHs can be used to read the AFC ID and the location (e.g., using the MMIO offset and length information) of corresponding AFC registers 516 in the user partition (see also registers 516 in
The AFC ID may help identify the functionality that is provided by each AFC. For example, the AFC ID may help identify whether a particular AFC implements a security function (e.g., an encryption/decryption engine), a storage function (e.g., a compression/decompression engine), a machine learning function (e.g., a convolutional neural network engine or a deep neural network engine), a networking function (e.g., a network function virtualization engine), etc.
The AFU/AFC ID, MMIO offset, MMIO length, and other control/status information associated with each of the DFHs may be stored in programmable registers 514. Registers 514 and 516 may have read/write access from the physical function (e.g., the AFU/AFC identifier and location information may be updated by the physical function during partial reconfiguration operations or during a VF assignment/removal process). Programmable registers 514 may be formed as part of PIC 250 (see also registers 514 in
The physical function operates in a host physical address (HPA) space. In contrast, the virtual functions which can be assigned to a VM or a container operates in the guest physical address (GPA) space. Thus, the MMIO offset from the PF's perspective may be different than the MMIO offset from the VF's perspective to avoid gaps in the GPA for instance. For example, consider a scenario in which each AFC is worth one 4K page, so the MMIO offset for AFC-0 might be 0x0000, whereas the MMIO offset for AFC-1 is 0x1000. However, if AFC-1 is assigned to a virtual function, it might be allocated to MMIO offset 0x0000 to avoid opening up holes in the GPA space from the VF's perspective. While registers 514 have read/write access from the perspective of the physical function (i.e., from PF's view), registers 514 have read-only access from the perspective of the virtual functions (i.e., from VF's view).
In the example of
The various device feature headers (DFHs) described above are generally implemented using programmable DFH registers on device 100. With the addition of context_id tags to transactions traverse communications interface 254, the DFH registers will have to take these tags into account. In one suitable arrangement, the PIC DFH and optionally the Port-0 DFH (referred to collectively as the “AFU DFH”) may be implemented as a programmable register in PIC 250 (see, e.g., DFH register 290 in
In another suitable arrangement, the AFU DFH is contained within the AFU (see, e.g., AFU DFH register 294 in
In yet another suitable arrangement, the AFU DFH and the AFC DFHs may be all implemented as programmable registers in PIC 250. In this configuration, the DFH registers may be updated in parallel along with the partial reconfiguration of each AFU.
At step 904, at least one AFU is reconfigured using the AFU bitstream. Upon updating the AFU, the corresponding DFH registers in PIC 250 may be updated using the received AFU metadata (e.g., to update DFH registers with new AFU and AFC IDs, with new AFU power budget, or with other control settings). At step 908, PIC 250 may set up the context mapping table to include the new AFU/AFC settings (e.g., to optionally update the context_id if necessary). After updating the context mapping table, AFU reprogramming may be complete (at step 910).
In general, the AFU/AFC identifier and location information, the context mapping tables, and other associated context-level parameters may be updated by the physical function during partial reconfiguration operations or during a VF assignment/removal process.
At step 1002, an AFC that needs to be assigned to a virtual function may be identified. At step 1004, the identified AFC may be quiesced. At step 1006, a context-level reset (see
At step 1008, context mapping table 260-1 (
At step 1010, the AFC's location from the PF's perspective can be stored in driver structures (i.e., in software) and the actual DFH may be updated to reflect the VF's view. At step 1012, VFx may then be assigned to a virtual machine or container. In general, the steps of
At step 1052, a corresponding AFC that is associated with VFy may be identified. At step 1054, the identified AFC may be quiesced. At step 1056, VFy may be removed from the virtual machine or container that it currently resides on.
At step 1058, a context-level reset (see
At step 1062, the AFC's location may then be restored from the PF's perspective. In general, the steps of
The embodiments thus far have been described with respect to integrated circuits. The methods and apparatuses described herein may be incorporated into any suitable circuit. For example, they may be incorporated into numerous types of devices such as programmable logic devices, application specific standard products (ASSPs), and application specific integrated circuits (ASICs), microcontrollers, microprocessors, central processing units (CPUs), graphics processing units (GPUs), etc. Examples of programmable logic devices include programmable arrays logic (PALs), programmable logic arrays (PLAs), field programmable logic arrays (FPGAs), electrically programmable logic devices (EPLDs), electrically erasable programmable logic devices (EEPLDs), logic cell arrays (LCAs), complex programmable logic devices (CPLDs), and field programmable gate arrays (FPGAs), just to name a few.
The programmable logic device described in one or more embodiments herein may be part of a data processing system that includes one or more of the following components: a processor; memory; IO circuitry; and peripheral devices. The data processing can be used in a wide variety of applications, such as computer networking, data networking, instrumentation, video processing, digital signal processing, or any suitable other application where the advantage of using programmable or re-programmable logic is desirable. The programmable logic device can be used to perform a variety of different logic functions. For example, the programmable logic device can be configured as a processor or controller that works in cooperation with a system processor. The programmable logic device may also be used as an arbiter for arbitrating access to a shared resource in the data processing system. In yet another example, the programmable logic device can be configured as an interface between a processor and one of the other components in the system.
The foregoing is merely illustrative of the principles of this invention and various modifications can be made by those skilled in the art. The foregoing embodiments may be implemented individually or in any combination.