In highly virtualized environments, significant amounts of server resources are expended processing tasks that are beyond user applications. Such processing tasks can include hypervisors, container engines, network and storage functions, security, and large amounts of network traffic. To address these various processing tasks, programmable network interface devices with accelerators and network connectivity have been introduced. These programmable network interface devices are referred to as infrastructure processing units (IPUs), data processing units (DPUs), edge processing units (EPUs), programmable network devices, and so on. The programmable network interface devices can accelerate and manage infrastructure functions using dedicated and programmable cores deployed in the devices. The programmable network interface devices can provide for infrastructure offload and an extra layer of security by serving as a control point of the host for running infrastructure applications. By using a programmable network interface devices, the overhead associated with running infrastructure tasks can be offloaded from a server device.
Embodiments described herein are illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements, and in which:
In the following description, numerous specific details are set forth to provide a more thorough understanding. However, it will be apparent to one of skill in the art that the embodiments described herein may be practiced without one or more of these specific details. In other instances, well-known features have not been described to avoid obscuring the details of the present embodiments.
The processing subsystem 101, for example, includes one or more parallel processor(s) 112 coupled to memory hub 105 via a communication link 113, such as a bus or fabric. The communication link 113 may be one of any number of standards-based communication link technologies or protocols, such as, but not limited to PCI Express, or may be a vendor specific communications interface or communications fabric. The one or more parallel processor(s) 112 may form a computationally focused parallel or vector processing system that can include a large number of processing cores and/or processing clusters, such as a many integrated core (MIC) processor. For example, the one or more parallel processor(s) 112 form a graphics processing subsystem that can output pixels to one of the one or more display device(s) 110A coupled via the I/O hub 107. The one or more parallel processor(s) 112 can also include a display controller and display interface (not shown) to enable a direct connection to one or more display device(s) 110B.
Within the I/O subsystem 111, a system storage unit 114 can connect to the I/O hub 107 to provide a storage mechanism for the computing system 100. An I/O switch 116 can be used to provide an interface mechanism to enable connections between the I/O hub 107 and other components, such as a network adapter 118 and/or wireless network adapter 119 that may be integrated into the platform, and various other devices that can be added via one or more add-in device(s) 120. The add-in device(s) 120 may also include, for example, one or more external graphics processor devices, graphics cards, and/or compute accelerators. The network adapter 118 can be an Ethernet adapter or another wired network adapter. The wireless network adapter 119 can include one or more of a Wi-Fi, Bluetooth, near field communication (NFC), or other network device that includes one or more wireless radios.
The computing system 100 can include other components not explicitly shown, including USB or other port connections, optical storage drives, video capture devices, and the like, which may also be connected to the I/O hub 107. Communication paths interconnecting the various components in
The one or more parallel processor(s) 112 may incorporate circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU). Alternatively or additionally, the one or more parallel processor(s) 112 can incorporate circuitry optimized for general purpose processing, while preserving the underlying computational architecture, described in greater detail herein. Components of the computing system 100 may be integrated with one or more other system elements on a single integrated circuit. For example, the one or more parallel processor(s) 112, memory hub 105, processor(s) 102, and I/O hub 107 can be integrated into a system on chip (SoC) integrated circuit. Alternatively, the components of the computing system 100 can be integrated into a single package to form a system in package (SIP) configuration. In one embodiment at least a portion of the components of the computing system 100 can be integrated into a multi-chip module (MCM), which can be interconnected with other multi-chip modules into a modular computing system. An MCM or SIP configuration can include multiple integrated circuits, chiplets, dielets, tiles, or other circuit forms. The term chip may refer to a packaged die, while die may refer to a bare singulated instantiation of a chip design that is not packaged. However, the term chip and die are often used interchangeably in the art. When described herein, the term chiplet is intended to convey an at least partially packaged integrated circuit that may be integrated with other circuits in an MCM or SIP configuration.
In some configurations, the computing system 100 includes one or more accelerator device(s) 130 coupled with the memory hub 105, in addition to the processor(s) 102 and the one or more parallel processor(s) 112. The accelerator device(s) 130 are configured to perform domain specific acceleration of workloads to handle tasks that are computationally intensive or demand high throughput. The accelerator device(s) 130 can reduce the burden placed on the processor(s) 102 and/or parallel processor(s) 112 of the computing system 100. The accelerator device(s) 130 can include but are not limited to smart network interface cards, data processing units, cryptographic accelerators, storage accelerators, artificial intelligence (AI) accelerators, compression units, neural processing units (NPUs), storage accelerators, and/or video transcoding accelerators.
It will be appreciated that the computing system 100 shown herein is illustrative and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of processor(s) 102, and the number of parallel processor(s) 112, may be modified as desired. For instance, system memory 104 can be connected to the processor(s) 102 directly rather than through a bridge, while other devices communicate with system memory 104 via the memory hub 105 and the processor(s) 102. In other alternative topologies, the parallel processor(s) 112 are connected to the I/O hub 107 or directly to one of the one or more processor(s) 102, rather than to the memory hub 105. In other embodiments, the I/O hub 107 and memory hub 105 may be integrated into a single chip. It is also possible that two or more sets of processor(s) 102 are attached via multiple sockets, which can couple with two or more instances of the parallel processor(s) 112.
Some of the particular components shown herein are optional and may not be included in all implementations of the computing system 100. For example, any number of add-in cards or peripherals may be supported, or some components may be eliminated. Furthermore, some architectures may use different terminology for components similar to those illustrated in
The system 200 may include workload clusters 218A-218B. The workload clusters 218A-218B can include a rack 248 that houses multiple servers (e.g., server 246). The rack 248 and the servers of the workload clusters 218A-218B may conform to the rack unit (“U”) standard, in which one rack unit conforms to a 19 inch wide rack frame and a full-sized industry standard rack accommodates 42 units (42U) of equipment. One unit (1U) of equipment (e.g., a 1U server) may be 1.75 inches high and approximately 36 inches deep. In various configurations, compute resources such as processors, memory, storage, accelerators, and switches may fit into some multiple of rack units within a rack 248.
The server 246 may host a standalone operating system configured to provide server functions, or the servers may be virtualized. A virtualized server may be under the control of a virtual machine manager (VMM), hypervisor, and/or orchestrator, and may host one or more virtual machines, virtual servers, or virtual appliances. The workload clusters 218A-218B may be collocated in a single datacenter, or may be located in different geographic datacenters. Depending on the contractual agreements, some servers may be specifically dedicated to certain enterprise clients or tenants while other servers may be shared.
The various devices in a datacenter may be connected to one other via a switching fabric 270, which may include one or more high speed routing and/or switching devices. The switching fabric 270 may provide north-south traffic 202 (e.g., traffic to and from the wide area network (WAN), such as the internet), and east-west traffic 204 (e.g., traffic across the datacenter). Historically, north-south traffic 202 accounted for the bulk of network traffic, but as web services become more complex and distributed, the volume of east-west traffic 204 has risen. In many datacenters, cast-west traffic 204 now accounts for the majority of traffic. Furthermore, as the capability of the server 246 increases, traffic volume may further increase. For example, the server 246 may provide multiple processor slots, with respective slots accommodating a processor having four to eight cores, along with sufficient memory for the cores. Thus, the server may host a number of VMs that may be a source of traffic generation.
To accommodate the large volume of traffic in a datacenter, a highly capable implementation of the switching fabric 270 may be provided. The illustrated implementation of the switching fabric 270 is an example of a flat network in which the server 246 may have a direct connection to a top-of-rack switch (ToR switch 220A-220B) (e.g., a “star” configuration). ToR switch 220A can connect with workload cluster 218A, while ToR switch 220B can connect with workload cluster 218B. The ToR switch 220A-220B may couple to a core switch 260. This two-tier flat network architecture is shown as an illustrative example and other architectures may be used, such as three-tier star or leaf-spine (also called “fat tree” topologies) based on the “Clos” architecture, hub-and-spoke topologies, mesh topologies, ring topologies, or 3-D mesh topologies, by way of nonlimiting example.
The switching fabric 270 may be provided by any suitable interconnect using any suitable interconnect protocol. For example, the server 246 may include a fabric interface (FI) of some type, a network interface card (NIC), or other host interface. The host interface itself may couple to one or more processors via an interconnect or bus, such as PCI, PCIe, or similar, and in some cases, this interconnect bus may be considered to be part of the switching fabric 270. The switching fabric may also use PCIe physical interconnects to implement more advanced protocols, such as compute express link (CXL).
The interconnect technology may be provided by a single interconnect or a hybrid interconnect, such as where PCIe provides on-chip communication, 1 Gb or 10 Gb copper Ethernet provides relatively short connections to a ToR switch 220A-220B, and optical cabling provides relatively longer connections to core switch 260. Interconnect technologies include, by way of nonlimiting example, Ultra Path Interconnect (UPI), FibreChannel, Ethernet, FibreChannel over Ethernet (FCOE), InfiniBand, PCIe, NVLink, or fiber optics, to name just a few. Some of these will be more suitable for certain deployments or functions than others, and selecting an appropriate fabric for the instant application is an exercise of ordinary skill.
In one embodiment, the switching elements of the switching fabric 270 are configured to implement switching techniques to improve the performance of the network in high usage scenarios. Example advanced switching techniques include but are not limited to adaptive routing, adaptive fault recovery, and adaptive and/or telemetry-based congestion control.
Adaptive routing enables a ToR 220A-220B switch and/or core switch 260 to select the output port to which traffic is switched based on the load on the selected port, assuming unconstrained port selection is enabled. An adaptive routing table can configure the forwarding tables of switches of the switching fabric 270 to select between multiple ports between switches when multiple connections are present between a given set of switches in an adaptive routing group. Adaptive fault recovery (e.g., self-healing) enables the automatic selection of an alternate port if the ported selected by the forwarding table port is in a failed or inactive state, which enables rapid recovery in the event of a switch-to-switch port failure. A notification can be sent to neighboring switches when adaptive routing or adaptive fault recovery becomes active in a given switch. Adaptive congestion control configures a switch to send a notification to neighboring switches when port congestion on that switch exceeds a configured threshold, which may cause those neighboring switches to adaptively switch to uncongested ports on that switch or switches associated with an alternate route to the destination.
Telemetry-based congestion control uses real-time monitoring of telemetry from network devices, such as switches within the switching fabric 270, to detect when congestion will begin to impact the performance of the switching fabric 270 and proactively adjust the switching tables within the network devices to prevent or mitigate the impending congestion. A ToR 220A-220B switch and/or core switch 260 can implement a built-in telemetry-based congestion control algorithm or can provide an API though which a programmable telemetry-based congestion control algorithm can be implemented. A continuous feedback loop may be implemented in which the telemetry-based congestion control system continuously monitors the network and adjusts the traffic flow in real-time based on ongoing telemetry data. Learning and adaptation can be implemented by the telemetry-based congestion control system in which the system can adapt to changing network conditions and improve its congestion control strategies based on historical data and trends.
Note however that while high-end fabrics are provided herein by way of illustration, more generally, the switching fabric 270 may include any suitable interconnect or bus for the particular application, including legacy interconnects used to implement a local area network (LANs), synchronous optical networks (SONET), asynchronous transfer mode (ATM) networks, wireless networks such as Wi-Fi and Bluetooth, 5G wireless, DSL interconnects, MOCA, or similar. It is also expressly anticipated that in the future, new network technologies will arise to supplement or replace some of those listed here, and any such future network topologies and technologies can be or form a part of the switching fabric 270.
The datacenter 300 includes a number of logic elements forming a plurality of nodes, where the respective nodes may be provided by a physical server, a group of servers, or other hardware. The server may also host one or more virtual machines, as appropriate to its application. A fabric 370 is provided to interconnect various aspects of datacenter 300. The fabric 370 may be provided by any suitable interconnect technology, including but not limited to InfiniBand, Ethernet, PCIe, or CXL. The fabric 370 of the datacenter 300 may be a version of and/or include elements of the switching fabric 270 of the system 200 of
The server nodes of the datacenter 300 can include but are not limited to a memory server node 304, a heterogenous compute server node 306, a CPU server node 308, and a storage server node 310. The heterogenous compute server node 306 and a CPU server node 308 can perform independent operations for different tenants or cooperatively perform operations for a single tenant. The heterogenous compute server node 306 and a CPU server node 308 can also host virtual machines that provide virtual server functionality to tenants of the datacenter.
The server nodes can connect with the fabric 370 via a fabric interface 372. The specific type of fabric interface 372 that is used depends at least in part on the technology or protocol that is used to implement the fabric 370. For example, where the fabric 370 is an Ethernet fabric, where the respective fabric interface 372 may be an Ethernet network interface controller. Where the fabric 370 is a PCIe-based fabric, the fabric interfaces may be PCIe-based interconnects. Where the fabric 370 is an InfiniBand fabric, the fabric interface 372 of the heterogenous compute server node 306 and a CPU server node 308 may be a host channel adapter (HCA), while the fabric interface 372 of the memory server node 304 and storage server node 310 may be a target channel adapter (TCA). TCA functionality may be an implementation-specific subset of HCA functionality. The various fabric interfaces may be implemented as intellectual property (IP) blocks that can be inserted into an integrated circuit as a modular unit, as can other circuitry within the datacenter 300.
The heterogenous compute server node 306 includes multiple CPU sockets that can house a CPU 319, where the respective CPU 319 may be, but is not limited to an Intel® Xeon™ processor including a plurality of cores. The CPU 319 may also be, for example, a multi-core datacenter class ARM® CPU, such as an NVIDIA® Grace™ CPU. The heterogenous compute server node 306 includes memory devices 318 to store data for runtime execution and storage devices 316 to enable the persistent storage of data within non-volatile memory devices. The heterogenous compute server node 306 is enabled to perform heterogenous processing via the presence of GPUs (e.g., GPU 317), which can be used, for example, to perform high-performance compute (HPC), media server, cloud gaming server, and/or machine learning compute operations. In one configuration, the GPUs may be interconnected with one other and CPUs of the heterogenous compute server node 306 via interconnect technologies such as PCIe, CXL, or NVLink.
The CPU server node 308 includes a plurality of CPUs (e.g., CPU 319), memory (e.g., memory devices 318) and storage (storage devices 316) to execute applications and other program code that provide server functionality, such as web servers or other types of functionality that is remotely accessible by clients of the CPU server node 308. The CPU server node 308 can also execute program code that provides services or micro-services that enable complex enterprise functionality. The fabric 370 will be provisioned with sufficient throughput to enable the CPU server node 308 to be simultaneously accessed by a large number of clients, while also retaining sufficient throughput for use by the heterogenous compute server node 306 and to enable the use of the memory server node 304 and the storage server node 310 by the heterogenous compute server node 306 and the CPU server node 308. Furthermore, in one configuration, the CPU server node 308 may rely primarily on distributed services provided by the memory server node 304 and the storage server node 310, as the memory and storage of the CPU server node 308 may not be sufficient for all of the operations intended to be performed by the CPU server node 308. Instead, a large pool of high-speed or specialized memory may be dynamically provisioned between a number of nodes, so that the respective node has access to a large pool of resources, but those resources do not sit idle when that particular node does not utilize them. A distributed architecture of this type is possible due to the high speeds and low latencies provided by the fabric 370 of contemporary datacenters and may be advantageous because the resources do not have to be over-provisioned for the server node.
The memory server node 304 can include memory nodes 305 having memory technologies that are suitable for the storage of data used during the execution of program code by the heterogenous compute server node 306 and the CPU server node 308. The memory nodes 305 can include volatile memory modules, such as DRAM modules, and/or non-volatile memory technologies that can operate similar to DRAM speeds (e.g., 3D XPoint memory), such that those modules have sufficient throughput and latency performance metrics to be used as a tier of system memory at execution runtime. The memory server node 304 can be linked with the heterogenous compute server node 306 and/or CPU server node 308 via technologies such as CXL.mem, which enables memory access from a host to a device. In such configuration, a CPU 319 of the heterogenous compute server node 306, a CPU server node 308 can link to the memory server node 304 and access the memory nodes 305 of the memory server node 304 in a similar manner as, for example, the CPU 319 of the heterogenous compute server node 306 can access device memory of a GPU within the heterogenous compute server node 306. For example, the memory server node 304 may provide remote direct memory access (RDMA) to the memory nodes 305, in which, for example, the CPU server node 308 may access memory resources on the memory server node 304 via the fabric 370 using DMA operations, in a similar manner as how the CPU would access its own onboard memory.
The memory server node 304 can be used by the heterogenous compute server node 306 and CPU server node 308 to expand the runtime memory that is available during memory-intensive activities such as the training of machine learning models. A tiered memory system can be enabled in which model data can be swapped into and out of the memory devices 318 of the heterogenous compute server node 306 to memory of the memory server node 304 at higher performance and/or lower latency than local storage (e.g., storage devices 316). During workload execution setup, the working set of data may be loaded into one or more of the memory nodes 305 of the memory server node 304 and loaded into the memory devices 318 of the heterogenous compute server node 306 during execution of a heterogenous workload.
The storage server node 310 provides storage functionality to the heterogenous compute server node 306, the CPU server node 308, and potentially the memory server node 304. The storage server node 310 may provide a networked bunch of disks (NBOD), program flash memory (PFM), redundant array of independent disks (RAID), redundant array of independent nodes (RAIN), network attached storage (NAS), or other nonvolatile memory solutions. In one configuration, the storage server node 310 can couple with the heterogenous compute server node 306, the CPU server node 308, and/or the memory server node 304 such as NVMe-OF, which enables the NVME protocol to be implemented over the fabric 370. In such configurations, the fabric interface 372 of those servers may be smart interfaces that include hardware to accelerate NVMe-oF operations.
The accelerators 330 within the datacenter 300 can provide various accelerated functions, including hardware or coprocessor acceleration for functions such as packet processing, encryption, decryption, compression, decompression, network security, or other accelerated functions in the datacenter. In some examples, accelerators 330 may include deep learning accelerators, such as neural processing units (NPU), that can receive offload of matrix multiply operations of other neural network operations from the heterogenous compute server node 306 or the CPU server node 308. In some configurations, the accelerators 330 may reside in a dedicated accelerator server or distributed throughout the various server nodes of the datacenter 300. For example, an NPU may be directly attached to one or more CPU cores within the heterogenous compute server node 306 or the CPU server node 308. In some configurations, the accelerators 330 can include or be included within smart network controllers, infrastructure processing units (IPUs), or data processing units (DPUs), or edge processing units (EPUs), which combine network controller functionality with accelerator, processor, or coprocessor functionality.
In one configuration, the datacenter 300 can include gateways 340A-340B from the fabric 370 to other fabrics, fabric architectures, or interconnect technologies. For example, where the fabric 370 is an InfiniBand fabric, the gateways 340A-340B may be gateways to an Ethernet fabric. Where the fabric 370 is an Ethernet fabric, the gateways 340A-340B may include routers to route data to other portions of the datacenter 300 or to a larger network, such as the Internet. For example, a first gateway 340A may connect to a different network or subnet within the datacenter 300, while a second gateway 340B may be a router to the Internet.
The orchestrator 360 manages the provisioning, configuration, and operation of network resources within the datacenter 300. The orchestrator 360 may include hardware or software that executes on a dedicated orchestration server. The orchestrator 360 may also be embodied within software that executes, for example, on the CPU server node 308 that configures software defined networking (SDN) functionality of components within the datacenter 300. In various configurations, the orchestrator 360 can enable automated provisioning and configuration of components of the datacenter 300 by performing network resource allocation and template-based deployment. Template-based deployment is a method for provisioning and managing IT resources using predefined templates, where the templates may be based on standard templates utilized by the government, service provider, financial, standard or customer. The template may also dictate service level agreements (SLA) or service level obligations (SLO). The orchestrator 360 can also perform functionality including but not limited to load balancing and traffic engineering, network segmentation, security automation, real-time telemetry monitoring, and adaptive switching management, including telemetry-based adaptive switching. In some configurations, the orchestrator 360 can also provide multi-tenancy and virtualization support by enabling virtual network management, including the creation and deletion of virtual LANs (VLANs) and virtual private networks (VPNs), and tenant isolation for multi-tenant datacenters.
In various network configurations, the forwarding element is deployed as a non-edge forwarding element in the interior of the network to forward data messages from a source device to a destination device. In network configurations, the forwarding element 400 is deployed as an edge forwarding element at the edge of the network to connect to compute devices (e.g., standalone or host computers) that serve as sources and destinations of the data messages. As a non-edge forwarding element, the forwarding element 400 forwards data messages between forwarding elements in the network, such as through an intervening network fabric. As an edge forwarding element, the forwarding element 400 forwards data messages to and from edge compute devices to one other, to other edge forwarding elements and/or to non-edge forwarding elements.
The forwarding element 400 includes circuitry to implement a data plane 402 that performs the forwarding operations of the forwarding element 400 to forward data messages received by the forwarding element to other devices. The forwarding element 400 also includes circuitry to implement a control plane 404 that configures the data plane circuit. Additionally, the forwarding element 400 includes physical ports 406 that receive data messages from, and transmit data messages to, devices outside of the forwarding element 400. The data plane 402 includes ports 408 that receive data messages from the physical ports 406 for processing. The data messages are processed and forwarded to another port on the data plane 402, which is connected to another physical port of the forwarding element 400. In addition to being associated with physical ports of the forwarding element 400, some of the ports 408 on the data plane 402 may be associated with other modules of the data plane 402.
The data plane includes programmable packet processor circuits that provide several programmable message-processing stages that can be configured to perform the data-plane forwarding operations of the forwarding element 400 to process and forward data messages to their destinations. These message-processing stages perform these forwarding operations by processing data tuples (e.g., message headers) associated with data messages received by the data plane 402 in order to determine how to forward the messages. The message-processing stages include match-action units (MAUs) that try to match data tuples (e.g., header vectors) of messages with table records that specify action to perform on the data tuples. In some embodiments, table records are populated by the control plane 404 and are not known when configuring the data plane to execute a program provided by a network user. The programmable message-processing circuits are grouped into multiple message-processing pipelines. The message-processing pipelines can be ingress or egress pipelines before or after the forwarding element's traffic management stage that directs messages from the ingress pipelines to egress pipelines.
The specifics of the hardware of the data plane 402 depends on the communication protocol implemented via the forwarding element 400. Ethernet switches use application specific integrated circuits (ASICs) designed to handle Ethernet frames and the TCP/IP protocol stack. These ASICs are optimized for a broad range of traffic types, including unicast, multicast, and broadcast. Ethernet switch ASICs are generally designed to balance cost, power consumption, and performance, although high-end Ethernet switches may support more advanced features such as deep packet inspection and advanced QoS (Quality of Service). InfiniBand switches use specialized ASICs designed for ultra-low latency and high throughput. These ASICs enable features such as optimized for handling the InfiniBand protocol and provide support for RDMA and other features that utilize precise timing and high-speed data processing, although high-end Ethernet switches may support RoCE (RDMA over Converged Ethernet), which offers similar benefits to InfiniBand but with higher latency compared to native InfiniBand RDMA.
The forwarding element 400 may also be configured as an NVLink switch (e.g., NVSwitch), which is used to interconnect multiple graphics processors via the NVLink connection protocol. When configured as an NVLink switch, the forwarding element 400 can provide GPU servers with increased GPU to GPU bandwidth relative to GPU servers interconnected via InfiniBand. An NVLink switch can reduce network traffic hotspots that may occur when interconnected GPU-equipped servers execute operations such as distributed neural network training.
In general, where the data plane 402, in concert with a program executed on the data plane 402 (e.g., a program written in the P4 language), performs message or packet forwarding operations for incoming data, the control plane 404 determines how messages or packets should be forwarded. The behavior of a program executed on the data plane 402 is determined in part by the control plane 404, which populates match-action tables with specific forwarding rules. The forwarding rules that are used by the program executed on the data plane 402 are independent of the data plane program itself. In one configuration, the control plane can couple with a management port 410 that enables administrator configuration of the forwarding element 400. The data connection that is established via the management port 410 is separate from the data connections for ingress and egress data ports. In one configuration, the management ports 410 may connect with a management plane 405, which facilitates administrative access to the device, enables the analysis of device state and health, and enables device reconfiguration. The management plane 405 may be a portion of the control plane 404 or in direct communication with the control plane 404. In one implementation, there is no direct access for the administrator to components of the control plane 404. Instead, information is gathered by the management plane 405 and the changes to the control plane 404 are carried out by the management plane 405.
The switches 432A-432E include a data plane 402, a control plane 404, a management plane 405, and physical ports 406, as in the forwarding element 400 of
An adaptive routing (AR) event may be detected by one of the switches along a route that becomes compromised, for example, when the switch when it attempts to output packets on a designated output port. For example, an example data from the source node 422 to the destination node 442 can traverse links through switches of the network. An AR event may be detected by switch 432D for link 429B, for example, in response to congestion or a link fault associated with link 429B. Upon detecting the AR event, switch 432D, as the detecting switch, generates an adaptive routing notification (ARN), which has an identifier that distinguishes an ARN packet from other packet types. In various embodiments, the ARN includes parameters such as an identifier for the detecting switch, the type of AR event, and the source and destination address of the flow that triggered the AR event, and/or any other suitable parameters. The detecting switch sends the ARN backwards along the route to the preceding switches. The ARN may include a request for notified switches to modify the route to avoid traversal of the detected switch. The notified switch can then evaluate whether its routes may be modified to bypass the detecting switch. Otherwise, the switch forwards the ARN to the previous preceding switch along the route. In this scenario, switch 432B is not able to avoid switch 432D and will relay the ARN to switch 432A. Switch 432A can determine to adapt the route to the destination node 442 by using link 427A to switch 432C. Switch 432C can reach switch 432E via link 429A, allowing packets from the source node 422 to reach the destination node 442 while bypassing the AR event related to link 429B.
In various configurations, the network 420 can also adapt to congestion scenarios via programmable data planes within the switches 432A-432E that are able to execute data plane programs to implement in-network congestion control algorithms (CCAs) for TCP over Ethernet-based fabrics. Using in-band network telemetry (INT), programmable data planes within the switches 432A-432E can become aware when a port or link along a route is becoming congested and preemptively seek to route packets over alternate paths. For example, switch 432A can load balance traffic to the destination node 442 between link 427A and link 427B based on the level of congestion seen on the routes downstream from those links.
The local memory 456 includes multiple queues, including an outer receive queue 462, an outer transmit queue 463, an inner receive queue 464, and an inner transmit queue 465. The outer queues are used for data that is received at a given multi-port IB interface that is to be forwarded back out the same multi-port IB interface. The inner queues are used for data that is forwarded out a different multi-port IB interface than used to receive the data. Other types of queue configurations may be implemented in local memory 456. For example, different queues may be present to support multiple traffic classes, either on an individual port basis, shared port basis, or a combination thereof. The multi-port IB interfaces 460A-460D include power management circuitry 455, which can adjust a power state of circuitry within the respective multi-port IB interface. Additionally power management logic that performs similar operations may be implemented as part of core switch logic.
The multi-port IB interfaces 460A-460D include packet processing and switching logic 458, which is generally used to perform aspects of packet processing and/or switching operations that are performed at the local multi-port level rather than across the IB switch as a whole. Depending on the implementation, the packet processing and switching logic 458 can be configured to perform a subset of the operations of the packet processing and switching logic 478 within the core switch logic 480, or can be configured with the full functionality of the packet processing and switching logic 478 within the core switch logic 480. The processing functionality of the packet processing and switching logic 458 may vary, depending on the complexity of the operations and/or speed the operations are to be performed. For example, the packet processing and switching logic 458 can include processors ranging from microcontrollers to multi-core processors. A variety of types or architectures of multi-core processors may also be used. Additionally, a portion of the packet processing operations may be implemented by embedded hardware logic.
The core switch logic 480 includes a crossbar 482, memory 470, a subnet management agent (SMA 476), and packet processing and switching logic 478. The crossbar 482 is a non-blocking low latency crossbar that interconnects the multi-port IB interfaces 460A-460D and connects with the memory 470. The memory 470 includes receive queues 472 and transmit queues 474. In one embodiment, packets to be switched between the multi-port IB interfaces 460A-460D can be received by the crossbar 482, stored in one of the receive queues 472, processed by the packet processing and switching logic 478, and stored in a transmit queues 474 for transmission to the outbound multi-port IB interface. In implementations that do not use the multi-port IB interfaces 460A-460D, the core switch logic 480 and crossbar 482 switches packets directly between I/O buffers associated with the port with the receive queues 472 and transmit queues 474 within the memory 470.
The packet processing and switching logic 478 includes programmable functionality and can execute data plane programs via a variety of types or architectures of multi-core processors. The packet processing and switching logic 478 is representative of the applicable circuitry and logic for implementing switching operations, as well as packet processing operations beyond which may be performed at the ports themselves. Processing elements of the packet processing and switching logic 478 executes software and/or firmware instructions configured to implement packet processing and switch operations. Such software and/or firmware may be stored in non-volatile storage on the switch itself. The software may also be downloaded or updated over a network in conjunction with initializing operations of the InfiniBand switch 450.
The SMA 476 is configurable to manage, monitor, and control functionality of the InfiniBand switch 450. The SMA 476 is also an agent of and in communication of the subnet manager (SM) for the subnet associated with the InfiniBand switch 450. The SM is the entity that discovers the devices within the subnet and performs a periodic sweep of the subnet to detect changes to the subnet's topology. One SMA within a subnet can be elected the primary SMA for the subnet and act as the SM. Other SMAs within the subnet will then communicate with that SMA. Alternatively, the SMA 476 can operate with other SMAs in the subnet to act as a distributed SM. In some embodiments, SMA 476 includes or executes on standalone circuitry and logic, such as a microcontroller, single core processor, or multi-core processor. In other embodiments, SMA 476 is implemented via software and/or firmware instructions executed on a processor core or other processing element that is part of a processor or other processing element used to implement packet processing and switching logic 478.
Embodiments are not specifically limited to implementations including multi-port IB interfaces 460A-460D. In one embodiment, the port is associated with its own receive and transmit buffers, with the crossbar 482 being configured to interconnect those buffers with receive queues 472 and transmit queues 474 in the memory 470. Packet processing and switching is then primarily performed by the packet processing and switching logic 478 of the core switch logic 480.
As shown in
In various configurations, the network interface device 500 is configurable to interface with networks including but not limited to Ethernet, including Ultra Ethernet. However, the network interface device 500 may also be configured as an InfiniBand or NVLink interface via the modification of various components. For example, the transceiver 502 can be capable of receiving and transmitting packets in conformance with the InfiniBand, Ethernet, or NVLink protocols. Other protocols may also be used. The transceiver 502 can receive and transmit packets from and to a network via a network medium. The transceiver 502 can include PHY circuitry 514 and media access control circuitry (MAC circuitry 516). PHY circuitry 514 can include encoding and decoding circuitry to encode and decode data packets according to applicable physical layer specifications or standards. MAC circuitry 516 can be configured to assemble data to be transmitted into packets, that include destination and source addresses along with network control information and error detection hash values.
The SoC/SIP 545 can include processors that may be any a combination of a CPU processor, graphics processing unit (GPU), field programmable gate array (FPGA), application specific integrated circuit (ASIC), or other hardware devices configurable for instruction execution. For example, a smart network interface can provide packet processing capabilities in the network interface using processors 505. Configuration of operation of processors 505, including programmable data plane processors, can be programmed using Programming Protocol-independent Packet Processors (P4), C, Python, Broadcom Network Programming Language (NPL), x86, or ARM compatible executable binaries or other executable binaries.
The packet allocator 524 can provide distribution of received packets for processing by multiple CPUs or cores using timeslot allocation. An interrupt coalesce circuit 522 can perform interrupt moderation in which the interrupt coalesce circuit 522 waits for multiple packets to arrive, or for a time-out to expire, before generating an interrupt to host system to process received packet(s). Receive Segment Coalescing (RSC) can be performed by the network interface device 500 in which portions of incoming packets are combined into segments of a packet. The network interface device 500 can then provide this coalesced packet to an application. A DMA engine 526 can copy a packet header, packet payload, and/or descriptor directly from host memory to the network interface or vice versa, instead of copying the packet to an intermediate buffer at the host and then using another copy operation from the intermediate buffer to the destination buffer. The memory 510 can be any type of volatile or non-volatile memory device and can store any queue or instructions used to program the network interface device 500. The transmit queue 507 can include data or references to data for transmission by network interface. The receive queue 508 can include data or references to data that was received by network interface from a network. The descriptor queues 520 can include descriptors that reference data or packets in transmit queue 507 or receive queue 508. The bus interface 512 can provide an interface with host device. For example, the bus interface 512 can be compatible with PCI Express, although other interconnection standards may be used.
As shown in
While the illustrated implementation of the network interface device 550 is shown as having a PCIe interface 558, other implementations can use other interfaces. For example, the network interface device 550 may use an Open Compute Project (OCP) mezzanine connector. Additionally, the PCIe interface 558 may also be configured with a multi-host solution that enables multiple compute or storage hosts to couple with the network interface device 550. The PCIe interface 558 may also support technology that enables direct PCIe access to multiple CPU sockets, which eliminates the network traffic having to traverse the inter-processor bus of a multi-socket server motherboard for a server that includes the network interface device 550.
The network interface device 550 implements endpoint elements of the InfiniBand architecture, which is based around queue pairs and RDMA. InfiniBand off-loads traffic control from software through the use of execution queues (e.g., work queues), which are initiated by a software client and managed in hardware. The communication endpoint includes a queue pair (QP) having a send queue and a receive queue. A QP is a memory-based abstraction where communication is achieved between memory-to-memory transfers between applications or between applications and devices. Communication to QPs occurs through virtual lanes of the network ports 552A-552B, which enable multiple independent data flows to share the same link, with separate buffering and flow control for the flow.
Communication occurs via channel I/O, in which a virtual channel directly connects two applications that exist in separate address spaces. The hardware transport engine 560 includes hardware logic to perform transport level operations via the QP for an endpoint. The RDMA engine 562 leverages the hardware transport engine 560 to perform RDMA operations between endpoints. The RDMA engine 562 implements RDMA operations in hardware and enables an application to read and write the memory of a remote system without OS kernel intervention or unnecessary data copies by allowing one endpoint of a communication channel to place information directly into the memory of another endpoint. The virtual endpoint logic 564 manages the operation of a virtual endpoint for channel I/O, which is a virtual instance of a QP that will be used by an application. The virtual endpoint logic 564 maps the QPs into the virtual address space of an application associated with a virtual endpoint.
Congestion control logic 563 performs operations to mitigate the occurrence of congestion on a channel. In various implementations, the congestion control logic 563 can perform flow control over a channel to limit congestion at the destination of a data transfer. The congestion control logic 563 can perform link level flow control to manage congestion at source congestion at virtual links of the network ports 552A-552B. In some implementations, the congestion control logic can take steps to limit congestion at intermediate points (e.g., IB switches) along a channel.
Offload engines 566 enable the offload of network tasks that may otherwise be performed in software to the network interface device 550. The offload engines 566 can support offload of operations including but not limited to offload of receive side scaling from a device driver or stateless network operations, for example, for TCP implementations over InfiniBand, such as TCP/UDP/IP stateless offload or VXLAN offload. The offload engines 566 can also implement operations of an interrupt coalesce circuit 522 of the network interface device 500 of
The QoS logic 568 can perform QoS operations, including QoS functionality that is inherent within the basic service delivery mechanism of InfiniBand. The QoS logic 568 can also implement enhanced InfiniBand QoS, such as fine grained end-to-end QoS. The QoS logic 568 can implement queuing services and management for prioritizing flows and guaranteeing service levels or bandwidth according to flow priority. For example, the QoS logic 568 can configure virtual lane arbitration for virtual lanes of the network ports 552A-552B according to flow priority. The QoS logic 568 can also operate in concert with the congestion control logic 563.
The GSA/SMA logic 569 implements general services agent (GSA) operations to manage the network interface device 550 and the InfiniBand fabric, as well as performing subnet management agent operations. The GSA operations include device-specific management tasks, such as querying device attributes, configuring device settings, and controlling device behavior. The GSA/SMA logic 569 can also implement SMA operations, including a subset of the operations performed by the SMA 476 of the InfiniBand switch 450 of
The management interface 570 provides support for a hardware interface to perform out-of-band management of the network interface device 550, such as an interconnect to a board management controller (BMC) or a hardware debug interface.
In one embodiment, access to remote storage containing model data can be accelerated by the programmable network interface 600. For example, the programmable network interface 600 can be configured to present remote storage devices as local storage devices to the host system. The programmable network interface 600 can also accelerate RDMA operations performed between GPUs of the host system with GPUs of remote systems. In one embodiment, the programmable network interface 600 can enable storage functionality such as, but not limited to NVME-OF. The programmable network interface 600 can also accelerate encryption, data integrity, compression, and other operations for remote storage on behalf of the host system, allowing remote storage to approach the latencies of storage devices that are directly attached to the host system.
The programmable network interface 600 can also perform resource allocation and management on behalf of the host system. Storage security operations can be offloaded to the programmable network interface 600 and performed in concert with the allocation and management of remote storage resources. Network-based operations to manage access to the remote storage that would otherwise by performed by a processor of the host system can instead be performed by the programmable network interface 600.
In one embodiment, network and/or data security operations can be offloaded from the host system to the programmable network interface 600. Data center security policies for a data center node can be handled by the programmable network interface 600 instead of the processors of the host system. For example, the programmable network interface 600 can detect and mitigate against an attempted network-based attack (e.g., DDoS) on the host system, preventing the attack from compromising the availability of the host system.
The programmable network interface 600 can include a system on a chip (SoC/SIP 620) that executes an operating system via multiple processor cores 622. The processor cores 622 can include general-purpose processor (e.g., CPU) cores. In one embodiment the processor cores 622 can also include one or more GPU cores. The SoC/SIP 620 can execute instructions stored in a memory device 640. A storage device 650 can store local operating system data. The storage device 650 and memory device 640 can also be used to cache remote data for the host system. Network ports 660A-660B enable a connection to a network or fabric and facilitate network access for the SoC/SIP 620 and, via the host interface 670, for the host system. In one configuration, a first network port 660A can connect to a first forwarding element, while a second network port 660B can connect to a second forwarding element. Alternatively, both network ports 660A-660B can be connected to a single forwarding element using a link aggregation protocol (LAG). The programmable network interface 600 can also include an I/O interface 675, such as a USB interface. The I/O interface 675 can be used to couple external devices to the programmable network interface 600 or as a debug interface. The programmable network interface 600 also includes a management interface 630 that enables software on the host device to manage and configure the programmable network interface 600 and/or SoC/SIP 620. In one embodiment the programmable network interface 600 may also include one or more accelerators or GPUs 645 to accept offload of parallel compute tasks from the SoC/SIP 620, host system, or remote systems coupled via the network ports 660A-660B. For example, the programmable network interface 600 can be configured with a graphics processor and participate in general-purpose or graphics compute operations in a datacenter environment.
One or more aspects may be implemented by representative code stored on a machine-readable medium which represents and/or defines logic within an integrated circuit such as a processor. For example, the machine-readable medium may include instructions which represent various logic within the processor. When read by a machine, the instructions may cause the machine to fabricate the logic to perform the techniques described herein. Such representations, known as “IP cores,” are reusable units of logic for an integrated circuit that may be stored on a tangible, machine-readable medium as a hardware model that describes the structure of the integrated circuit. The hardware model may be supplied to various customers or manufacturing facilities, which load the hardware model on fabrication machines that manufacture the integrated circuit. The integrated circuit may be fabricated such that the circuit performs operations described in association with any of the embodiments described herein.
The RTL design 715 or equivalent may be further synthesized by the design facility into a hardware model 720, which may be in a hardware description language (HDL), or some other representation of physical design data. The HDL may be further simulated or tested to verify the IP core design. The IP core design can be stored for delivery to a fabrication facility 765 using non-volatile memory 740 (e.g., hard disk, flash memory, or any non-volatile storage medium). The fabrication facility 765 may be a 3rd party fabrication facility. Alternatively, the IP core design may be transmitted (e.g., via the Internet) over a wired connection 750 or wireless connection 760. The fabrication facility 765 may then fabricate an integrated circuit that is based at least in part on the IP core design. The fabricated integrated circuit can be configured to perform operations in accordance with at least one embodiment described herein.
In highly virtualized environments, significant amounts of server resources are expended processing tasks that are beyond user applications. Such processing tasks can include hypervisors, container engines, network and storage functions, security, and large amounts of network traffic. To address these various processing tasks, programmable network interface devices with accelerators and network connectivity have been introduced. These programmable network interface devices are referred to as infrastructure processing units (IPUs), data processing units (DPUs), edge processing units (EPUs), advanced network interface devices, programmable packet processing devices, and so on.
The programmable network interface devices can accelerate and manage infrastructure functions using dedicated and programmable cores deployed in the devices. The programmable network interface devices can provide for infrastructure offload and an extra layer of security by serving as a control point of the host for running infrastructure applications. By using programmable network interface devices, the overhead associated with running infrastructure tasks can be offloaded from a server device.
In implementations herein, the programmable network interface devices may be referred to generally as a programmable network interface device (PNID), a network interface device, an advanced network interface device, an IPU, or a DPU, an EPU, or a programmable packet processing device, for example. For the discussion herein, the programmable network interface device will be referred to in abbreviated form as PNID.
In implementations herein, the highly virtualized environments that PNIDs may operate in can introduce security concerns. Workloads can share the same platform and should be kept separate from one other. The increased use for virtualization gives rise to stringent security requirements in the areas of software integrity and workload isolation.
As part of enforcing security requirements, such as software integrity and workload isolation, PNIDs can function as a foundational Hardware Root of Trust (HROT) that can spearhead a boot sequence by utilizing an embedded Integrated Management Complex (IMC) of the PNID. Utilizing the IMC of the PNID for the boot sequence enables the PNID to start from a secure and trusted state. Once the IMC has been securely booted, the PNID can continue the trust chain using a compute complex (CC) of the PNID. This progression is utilized to maintain a secure boot sequence that can then extend to a host system (host device) that is communicably coupled to the PNID. The PNID's orchestration of the boot process can further be extended to multi-tenant environments, where stringent security and isolation are demanded. However, the dynamism that of the multi-tenant environments, which may make use of virtual machines and/or containers that both utilize a root of trust, makes the management of these environments complex.
Conventional approaches to managing a secure boot sequence and providing trusted computing for multi-tenant environments have utilized a virtualized Trusted Platform Module (TPM). A TPM can be considered a HROT that enables remote attestation by digitally signing cryptographic hashes of software components. The attestation affirms that the software and/or hardware is genuine or correct. Common uses of TPMs are to verify platform integrity (to verify that the boot process starts from a trusted combination of hardware and software), and to store disk encryption keys. Hardware (discrete or physical) TPMs can be made available as TPM chips that are deployed on computing devices.
In conventional approaches, virtualizing the TPM allows for making the discrete TPM capabilities available to virtualized machines (e.g., VMs) running on a platform. In the conventional approaches, the TPM is virtualized by providing virtualized TPM instances for VMs on a single hardware platform. The virtual TPM instances can be linked to a hardware TPM that is running on the same hardware platform that the VMs are operating on. In some conventional approaches, in lieu of the discrete TPM running on the same hardware platform as the VMs, a virtual TPM manager may be provided on a secure coprocessor card on same hardware platform as the VMs where the virtual TPM manager provides the virtual TPM instances.
The conventional approaches do not provide for a discrete TPM co-located on the same processing complex with a virtual TPM manager instantiating the virtual TPM instances. This makes the management of the virtual TPMs for the various virtualized environments more complex and makes the management of transfers and migrations of the workloads of the virtualized environments more complex. Considering the limited number of TPMs available per computing host, the limited resources provided by the independent TPM, and the TPM being a core piece in the trusted computing framework, adoption of these TPM architecture techniques have conventionally been limited to the physical hosts running the clients (e.g., tenants).
Implementations herein address the above-noted technical problems by providing a scalable TPM in programmable network interface devices. The scalable TPM in programmable network interface devices discussed herein provides for a scalable TPM in a PNID, where the scalable TPM addresses the shortcomings of traditional discrete TPMs, which are limited by slower performance and the inability to efficiently handle multiple concurrent requests. Discrete TPMs also face the challenge of a limited number of Platform Configuration Registers (PCRs), which can quickly become insufficient when multiple system components demand secure operations.
Implementations herein provide trusted computing capabilities at scale by leveraging PNID resources to enable a high number of virtualized environments (e.g., VMs, containers) to interact with a TPM. Implementations herein utilize an IMC component of the PNID to manage the lifecycle of a set of virtual TPM (vTPM) instances. The vTPM instances are made available to the different virtualized environments of the infrastructure. For example, the virtualized environments may be hosted in the CC of the PNID, such as VM or container running on processing resources of the CC, may be a VMM or container operating system (COS) running on physical host device(s) communicably coupled to the PNID, and/or may be applications running on the VMs and/or the containers.
Implementations herein provide for technical advantages. For example, by embedding TPM capabilities within a PNID, the limitations of PCR scarcity found in the conventional approaches are mitigated. The PNID can support the creation of virtual TPMs that are provisioned with its own set of PCRs, resulting in enough resources to accommodate the security demands of the IMC, CC, and the host device system. The scalable approach to TPM resource allocation provided herein is also beneficial in multi-tenant environments, where separate runtime integrity attestation is provided for the respective operating system within the different VMs and/or containers. Consequently, the PNID-based TPM model discussed herein enhances the dynamic and secure distribution of TPM resources, allowing the respective tenant to uphold a distinct and secure environment. Also, the migration from device to device is enabled by transferring their related security data (e.g., PCRs) in a secure and scalable way.
Further details on the implementations of providing scalable TPM in programmable network interface devices are described below with respect to
The elements of
In various embodiments, components of computing environment 800 (including requesting, target, and/or consuming devices) may be coupled together through one or more networks (e.g., network) comprising any number of intervening network nodes, such as routers, switches, or other computing devices. The network, the requesting device, and/or the target device may be part of any suitable network topography, such as a data center network, a wide area network, a local area network, an edge network, or an enterprise network.
The storage command may be communicated from the requesting device to the target device and/or data read responsive to a storage command may be communicated from the target device to the consuming device over any suitable communication protocol (or multiple protocols), such as peripheral component interconnect (PCI), PCI Express (PCie), CXL, Universal Serial Bus (USB), Serial Attached SCSI (SAS), Serial ATA (SATA), InfiniBand, Fibre Channel (FC), IEEE 802.3, IEEE 802.11, Ultra Ethernet, or other current or future signaling protocol. The storage command may include, but is not limited to, commands to write data, read data, and/or erase data, for example. In particular embodiments, the storage commands conform with a logical device interface specification (also referred to herein as a network communication protocol) such as Non-Volatile Memory Express (NVMe) or Advanced Host Controller Interface (AHCI), for example.
A computing platform, such as computing environment 800, may include one or more requesting devices, consuming devices, and/or target devices. Such devices may comprise one or more processing units (e.g., processing units 845) to generate a storage command, decode and process a storage command, and/or consume (e.g., process) data requested by a storage command. As used herein, the terms “processor unit”, “processing unit”, “processor”, or “processing element”, may refer to any device or portion of a device that processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory.
A processing unit may include one or more digital signal processors (DSPs), application-specific integrated circuits (ASICs), central processing units (CPUs), graphics processing units (GPUs), general-purpose GPUs (GPGPUs), accelerated processing units (APUs), field-programmable gate arrays (FPGAs), neural network processing units (NPUs), edge processing units (EPUs), vector processing units, software defined processing units, video processing units, data processor units (DPUs), memory processing units, storage processing units, accelerators (e.g., graphics accelerator, compression accelerator, artificial intelligence accelerator, networking accelerator), controller cryptoprocessors (specialized processors that execute cryptographic algorithms within hardware), server processors, I/O controllers, NICs (e.g., SmartNICs), infrastructure processing units (IPUs), microcode engines, memory controllers (e.g., cache controllers, host memory controllers, DRAM controllers, SSD controllers, hard disk drive (HDD) controllers, nonvolatile memory controllers, etc.), or any other suitable type of processor units. As such, a processor unit may be referred to as an XPU.
Components of computing environment 800 may have any suitable characteristics of similar components of those described with respect to
In some embodiments, computing environment 800 may be a data center or other similar environment, where any combination of the components may be placed together in a rack or shared in a data center pod. In various embodiments, computing environment 800 may represent a telecom environment, in which any combination of the components may be enclosed together in curb/street furniture or an enterprise wiring closet.
In some embodiments, orchestrator 810 may function as a requesting device and send storage commands as described herein to storage devices 820A-C functioning as target devices. Some of these commands may read data that is then supplied to processing units 845 that are functioning as consuming devices. In some embodiments, a processing unit 845 or an PNID 850 may function as the requesting device. Thus, a processing unit 845 could be both a requesting device and the consuming device. In one implementation, PNID 850 may be the same as network interface device 500 of
In one configuration, the PNID 900 can include a network interface 910, memory 912, storage 914, an accelerator/GPU 916, a host interface 915, an IMC 920, and an CC 940. The IMC 920 can provide a dedicated management complex for the PNID 900, where the IMC 920 includes one or more processors and subsystems to provide secure boot, maintenance, and upgrades. In some implementations, the CC 940 utilizes processors to implement smart network interface device functionality. For example, the processors may include accelerators for various accelerated functionality, such as NVMe-oF or RDMA. The specific makeup of the PNID 900 depends on the protocol implemented via the PNID 900.
In various configurations, the PNID 900 is configurable to interface with networks including but not limited to InfiniBand, Ethernet, or NVLink. The CC 940 can include processors that may be any a combination of a CPU processor, graphics processing unit (GPU), field programmable gate array (FPGA), application specific integrated circuit (ASIC), or other programmable hardware device that allow programming of ANID P00. For example, a smart network interface can provide packet processing capabilities in the network interface using processors. Configuration of operation of CC 940, including programmable data plane processors, can be programmed using Programming Protocol-independent Packet Processors (P4), C, Python, Broadcom Network Programming Language (NPL), x86, or ARM compatible executable binaries or other executable binaries.
In some implementations, PNID 900 may be communicably coupled (e.g., over a network) to a host device, such as host device 960, via host interface 915 In one implementation, the host device 960 may be the same as one of processing units 845 described with respect to
As previously discussed, PNIDs, including PNID 900, can provide a scalable TPM, where the scalable TPM addresses the shortcomings of traditional discrete TPMs, which are limited by slower performance and the inability to efficiently handle multiple concurrent requests. As previously noted, discrete TPMs also face the challenge of a limited number of PCRs, which can quickly become insufficient when multiple system components demand secure operations. Implementations herein provide trusted computing capabilities at scale by leveraging PNID resources to enable a high number of virtualized environments (e.g., VMs, containers) to interact with a TPM.
Implementations herein utilize an IMC 920 of the PNID 900 to manage the lifecycle of a set of virtual TPM (vTPM) instances 935A, 935B, 935C (collectively referred to herein as vTPM instances 935). The vTPM instances 935 are made available to the different virtualized environments of the infrastructure. For example, a VMM 950 (or any other virtualized environment orchestrator) running on CC 940 of the PNID 900, a VMM 970 or container operating system (COS) 980 running on physical host device 960 communicably coupled to the PNID 900, and/or applications running on the VMs 955A, 955B, 975A, 975B and/or the containers 985A, 985B.
Implementations herein provide a variety of techniques to support instantiating, exposing, and/or migrating the vTPM instances 935. Implementations incorporate the PNID 900 (e.g., the IMC 920) as the entry point for out of bounds (OOB) and hardware root of trust (HROT). Implementations also can instantiate a vTPM manager 930 in the PNID IMC 920. The IMC 920 may include a discrete (e.g., hardware) TPM 925, which may be independent from a host TPM (if available, not shown), and is managed by the IMC 920. This discrete TPM 925 is linked to the vTPM manager 930 and enables the vTPM manager 930 component to run in a secure and isolated environment.
Implementations herein further expose at least two isolated ports, port 0 932A and port 1 932B, from the IMC 920 to allow communications to the vTPM manager 930. One port (e.g., port 1 932B) is made available to the CC 940, and the other port (e.g., port 0 932A) is made available to the host device 960. This enables isolation between infrastructure software that runs in the PNID CC 940 and the software running in the host device 960.
Implementations herein further provide for mitigating the challenges of workload (e.g., VMs 955A, 955B, 975A, 975B), containers (e.g., 985A, 985B), native applications) migration between host devices (e.g., host device 960 and another host device (not shown)). Implementation may restrict the transfer of sensitive vTPM data (e.g., data hosted in PCRs of IMC 920) and state using secure and trusted communication PNID-to-PNID channels established between the PNIDs corresponding to the host devices involved in the migration process. To reduce latency on the migration, implementations provide a mechanism to proxy TPM commands between the target (destination) PNID and the host (source) PNID. This proxy mechanism can remain in place until the vTPM state migration is completed and the target PNID can start servicing the migrated workload.
As previously noted, implementations herein provide for the instantiation of the vTPM manager 930 in the IMC 920 of the PNID 900. The IMC 920 includes a discrete TPM 925 (e.g., TPM hardware chip) that is independent to any TPM chips that may be available to the host device 960. The discrete TPM 925 enables firmware or an OS of the IMC 920 to establish a secure boot mechanism. This secure boot mechanism can control and confirm that the components in the boot chain are measured and verified before handling control to those components. The TPM 925 of the IMC 920 can be utilized as the storage for those measurements. Once the OS of the IMC 920 completes the booting process, it takes ownership of the TPM 925. In one implementation, the vTPM manager 930 can be included in the OS of the IMC 920 as a system service that is instantiated by the IMC 920. The vTPM manager 930 can perform functions such as creating vTPM instances 935 and multiplexing requests from clients to their associated vTPM instances 935.
The TPM specification stipulates that a TPM 925 establish a storage root key (SRK) as the root key for its key hierarchy. The key(s) that is generated has its private key encrypted by its parent key and thus creates a chain to the SRK. In the vTPM manager 930 of implementations herein, an independent key hierarchy can be created per vTPM instance 935. This allows for the vTPM instance 935 to be unlinked from the key hierarchy of the hardware TPM 925. This has the advantage that key generation is faster as the hardware TPM 925 does not have to be relied on for this. It can also simplify vTPM instance 935 migration.
Similarly, the vTPM manager 930 can generate an endorsement key (EK) per vTPM instance 935. This enables TPM commands that rely on decrypting information with the private part of the EK to also work after a vTPM instance 935 has migrated to another PNID. In implementations herein, if the SRK, EK, or any other persistent data of vTPM instances 935 are written into persistent memory, they are encrypted with a symmetric key rooted in the hardware TPM 925 by, for example, sealing it to the state of the hardware TPM's 925 PCRs during machine boot.
In some implementations, a challenger should establish trust in an environment that includes more than the content of the VM 955A, 955B, 975A, 975B or container 985A, 985B. Therefore, attestation support within the virtualized environment should not allow a challenger to learn about measurements inside the VM or container, but also about those of the environment that provides vTPM functionality. In one implementation, the attestation support provided herein allows for this by merging the PNID 900 environment with that of the VM 955A, 955B, 975A, 975B or container 985A, 985B by providing at least two different views of PCR registers. In one example, in the first view the lower set of PCR registers of a vTPM show the values of the hardware TPM 925 and in the second view the upper set of PCR registers reflect the values specific to that vTPM instance 935. In this way, a challenger can see all relevant measurements.
As previously noted, implementations herein provide for the usage of vTPM instances 935 for the PNID CC 940 component. In some implementations, to enable the CC 940 of the PNID 900 to make use of the vTPM capabilities provided by the IMC 920 of the PNID 900, a port, such as port 1 932B, is opened in the IMC 920 for the use of the CC 940. An OS running in the CC 940 can then communicate with the vTPM manager 930 in the IMC 920 by opening a connection to this port 1 932B. This capability may be made available to software running in VMs or containers hosted by the CC 940 by exposing it in the expected device mapper (e.g. /dev/tpm0 990A) into the target VM or Container. As a result, software can interact with the vTPM instance 935 in a transparent way by issuing TPM commands via a communication facility, such as TPM Command Transmission Interface (TCTI), Public-Key Cryptographic Standard #11 (PKCS11), or similar.
As previously noted, implementations herein provide for usage of vTPM instances 935 for the host device 960. In one implementation, to allow the host device 960 to make use of the vTPM capabilities provided by the IMC 920 of the PNID 900, another port, such as port 0 932A, is opened in the IMC 920 for use of the host device 960, independent to the port (port 1 932B) exposed to the CC 940 of the PNID 900. An OS of the host device 960 can then communicate with the vTPM manager 930 in the IMC 920 by opening connections to this port 0 932A, and make the capability available to software running in the VMs 975A, 975B or containers 985A, 985B running on the host device 960 by exposing it in the expected device mapper (e.g./dev/tpm0 990B, 990C) into the target VM 975A, 975B or container 985A, 985B. As a result, software can interact with the vTPM instance 935 in a transparent way by issuing TPM commands via a communication facility, such as TCTI, PKCS11, or similar.
In some implementations, third-party applications (apps) can also be managed in a secure way by utilizing runtime integrity attestation that is based on the TPM provided for the hosting OS. Application publishers should provide hashes to validate the integrity of the app (e.g., a golden image). An owner of an app repo/marketplace can validate the app and confirm the hash. Then, a runtime integrity attestation service can continuously verify the integrity of workloads. This can be done from a centralized place, or by requesting the one or more PNIDs 900 to validate this (e.g., having the IMC 920 maintain a storage component to keep track of workload hashes).
As previously discussed, the PNID 900 can provide discovery and/or enablement for scalable TPMs, such as the scalable TPMs implemented in PNIDs as described herein. In some implementations, the PNID 900 can provide an API though which the scalable TPM capability as described herein can be implemented. In some implementations, the API can query whether a computing system supports the scalable TPM capabilities. For example, the API can query whether the scalable TPM functionality as described herein is provided by the PNID 900. In some implementations, the API can enable or disable such a capability. For example, the API may be configured to enable and/or disable the scalable TPM functionality in the PNID 900.
As previously noted, implementations herein provide for workload migration support via the scalable TPM in a PNID. In order to support workload migration between systems, the vTPM instances 935 running on the PNID IMC 920 of the PNID 900 corresponding to the source host device 960 should be migrated to an IMC of a PNID associated with a destination host device. In some implementations, this can also apply for CC virtual environment migration.
As the vTPM migration process takes time to complete, the destination PNID IMC 1005 can forward 1020, 1025 the commands to the source PNID IMC 1002 over a trusted channel previously-established PNID to PNID handshaking protocol. The vTPM migration process then proceeds with the destination vTPM manager 1006 providing 1030 a nonce to the destination PNID IMC 1005, which then exports 1035 the nonce to the source PNID IMC 1002 corresponding to the source host 1003. The source vTPM manager 1001 is locked 1040 to the same nonce. The source vTPM manager 1001 and the source PNID IMC 1002 then coordinate to obtain 1045 and instance key, package 1050 the vTPM state, and delete 1055 the vTPM instance. The vTPM is also exported 1060 by the source PNID IMC 1002 with the Nonce to the destination PNID IMC 1005. The destination PNID IMC 1005 validates the nonce before importing the vTPM state to the destination vTPM manager 1006. Once all the state has been transferred to the destination vTPM manager 1006, the destination vTPM manager 1006 can coordinate with the destination PNID IMC 1005 to set 1065 the instance key, unpackage 1070 the vTPM instance state, and unlock 1075 the vTPM instance.
Furthermore, the command forwarding capabilities to the source vTPM via the source PNID IMC 1002 is stopped, and TPM commands can be now serviced by the destination vTPM manager 1006. In implementations herein, the command forwarding capabilities of can allow for live migration support with zero downtime due to latency on transferring TPM states. In addition, the PNID-to-PNID communication allows an increased level of security for the migration process because the PNID IMCs can attest between them, so all of the exchanged data is secure.
Method 1100 begins at processing block 1110 where the PNID may a discrete trusted platform module (dTPM) in an IMC of the PNID. In one implementation, the dTPM includes a hardware secure cryptoprocessor. Then, at block 1120, the PNID may enabling, via the dTPM, an IMC of the PNID to establish a secure boot mechanism for the programmable network interface device.
Subsequently, at block 1130, the IMC of the PNID may instantiate a virtual TPM (vTPM) manager that is linked to the dTPM. Lastly, at block 1140, the vTPM manager of the PNID may initiate one or more vTPM instances hosted by the vTPM manager in the IMC. In one implementation, the one or more vTPM instances correspond to one or more tenants hosted on at least one of the programmable network interface device or a host device communicably coupled to the programmable network interface device.
Method 1200 begins at processing block 1210 where a destination vTPM manager of a destination IMC of a destination PNID may receive a request to create a vTPM instance. In one implementation, the request is received from a destination host device communicably coupled to the destination PNID. In one implementation, the vTPM instance is to be migrated from a source vTPM manager of a source IMC at a source PNID. Then, at block 1220, the destination vTPM manager of the destination PNID may enable TPM command forwarding for the vTPM instance to a source vTPM manager of the source PNID.
Subsequently, at block 1230, the destination vTPM manager of the destination PNID may export a nonce to the source vTPM manager. In one implementation, the source vTPM manager is locked to the nonce and exports TPM state for the vTPM instance with the nonce. Then, at block 1240, the destination vTPM of the destination PNID may validate the nonce received from the source vTPM manager prior to importing any TPM state received from the source vTPM manager.
At block 1250, the destination vTPM manager of the destination PNID may, responsive to the transfer of the state of the vTPM from the source vTPM manager being finalized, terminate the TPM command forwarding for the vTPM instance. Lastly, at block 1260, the destination vTPM manager of the destination PNID may service TPM commands for the vTPM instance at the destination vTPM manager.
The following examples pertain to further embodiments. Example 1 is an apparatus to facilitate scalable TPM in programmable network interface devices. The apparatus of Example 1 includes a host interface; a network interface; and a programmable circuitry communicably coupled to the host interface and the network interface, the programmable circuitry comprising: one or more processors to implement network interface functionality; and a discrete trusted platform module (dTPM) to enable the one or more processors to establish a secure boot mechanism for the apparatus; wherein the one or more processors are to instantiate a virtual TPM (vTPM) manager that is associated with the dTPM, the vTPM manager to host vTPM instances corresponding to one or more virtualized environments hosted on at least one of the programmable circuitry or a host device communicable coupled to the apparatus.
In Example 2, the subject matter of Example 1 can optionally include wherein vTPM manager comprises a system service of an operating system (OS) executed by the one or more processors. In Example 3, the subject matter of any one of Examples 1-2 can optionally include wherein the one or more virtualized environments comprise at least one of virtual machines (VMs) or containers hosted by the at least one the programmable circuitry or the host device. In Example 4, the subject matter of any one of Examples 1-3 can optionally include wherein the vTPM manager is to generate an independent key hierarchy for respective vTPM instances hosted by the vTPM manager, and wherein the independent key hierarchy is unlinked from a key hierarchy of the dTPM, and wherein the vTPM manager is to generate an endorsement key for the respective vTPM instances created by the vTPM manager.
In Example 5, the subject matter of any one of Examples 1-4 can optionally include wherein the vTPM manager is to generate the independent key hierarchy by providing two views of platform configuration registers (PCR) registers of the programmable circuitry. In Example 6, the subject matter of any one of Examples 1-5 can optionally include wherein the two views comprise a first set of the PCR registers of a respective vTPM instance that are associated with the dTPM and a second set of the PCR registers of the respective vTPM instance that are associated with the respective vTPM instance.
In Example 7, the subject matter of any one of Examples 1-6 can optionally include wherein the one or more processors are to expose two ports to enable communication to the vTPM manager from a compute complex (CC) of the apparatus and from the host device, and wherein the two ports are isolated from one another.
In Example 8, the subject matter of any one of Examples 1-7 can optionally include wherein an integrated management complex (IMC) of the programmable circuitry is identified as a source IMC for migration of a workload of a migrating vTPM instance of the vTPM instances to a destination host device communicably coupled to a destination IMC of a destination programmable network interface device, and wherein the destination host device is to request the migration of the migrating vTPM instance to the destination IMC hosting a destination vTPM manager, and wherein the destination vTPM manager is to manage the migration of the workload and is to forward TPM commands for the workload to the source IMC as part of the migration of the workload.
In Example 9, the subject matter of any one of Examples 1-8 can optionally include wherein the destination IMC is to export a nonce to the source IMC and wherein the vTPM manager of the source IMC is locked to the nonce, a state of migrating vTPM instance is exported with the nonce, and the nonce is validated before import. In Example 10, the subject matter of any one of Examples 1-9 can optionally include wherein responsive to transfer of the state to the destination IMC being finalized, forwarding of the TPM commands to the source IMC is stopped and the TPM commands are serviced by the destination vTPM manager.
Example 11 is a method for facilitating scalable TPM in programmable network interface devices. The method of Example 11 can include hosting, by a programmable network interface device, a discrete trusted platform module (dTPM) that enables programmable circuitry of the programmable network interface device to establish a secure boot mechanism for the programmable network interface device; instantiating, by the programmable circuitry, a virtual TPM (vTPM) manager that is linked to the dTPM; and initiating, by the vTPM manager, vTPM instances hosted by the vTPM manager, the vTPM instances corresponding to one or more tenants hosted on at least one of the programmable network interface device or a host device communicable coupled to the programmable network interface device.
In Example 12, the subject matter of Example 11 can optionally include wherein the one or more tenants comprise at least one of virtual machines (VMs) or containers hosted by the at least one the programmable network interface device or the host device. In Example 13, the subject matter of Examples 11-12 can optionally include wherein the vTPM manager is to create an independent key hierarchy for respective vTPM instances created by the vTPM manager, and wherein the independent key hierarchy is unlinked from a key hierarchy of the dTPM, and wherein the vTPM manager is to generate an endorsement key for the respective vTPM instances created by the vTPM manager.
In Example 14, the subject matter of Examples 11-13 can optionally include wherein the programmable circuitry comprises an integrated management complex (IMC) that is to expose two ports to enable communication to the vTPM manager from a compute complex (CC) of the programmable network interface device and from the host device, and wherein the two ports are isolated from one another. In Example 15, the subject matter of Examples 11-14 can optionally include wherein the IMC is identified as a source IMC for migration of a workload of a migrating vTPM instance of the vTPM instances to a destination host device communicably coupled to a destination IMC of a destination programmable network interface device, and wherein the destination host device is to request the migration of the migrating vTPM instance to the destination IMC hosting a destination vTPM manager, and wherein the destination vTPM manager is to manage the migration of the workload and is to forward TPM commands for the workload to the source IMC as part of the migration of the workload.
In Example 16, the subject matter of Examples 11-15 can optionally include wherein the destination IMC is to export a nonce to the source IMC and wherein the vTPM manager of the source IMC is locked to the nonce, a state of migrating vTPM instance is exported with the nonce, and the nonce is validated before import, and wherein responsive to transfer of the state to the destination IMC being finalized, forwarding of the TPM commands to the source IMC is stopped and the TPM commands are serviced by the destination vTPM manager.
Example 17 is a non-transitory computer-readable storage medium for facilitating scalable TPM in programmable network interface devices. The non-transitory computer-readable storage medium of Example 17 having instructions stored thereon, which when executed by one or more processors, cause the one or more processors to perform operations comprising: hosting, by a programmable network interface device comprising the one or more processors, a discrete trusted platform module (dTPM) that enables programmable circuitry of the programmable network interface device to establish a secure boot mechanism for the programmable network interface device; instantiating, by the programmable circuitry, a virtual TPM (vTPM) manager that is linked to the dTPM; and initiating, by the vTPM manager, vTPM instances hosted by the vTPM manager, the vTPM instances corresponding to one or more tenants hosted on at least one of the programmable network interface device or a host device communicable coupled to the programmable network interface device.
In Example 18, the subject matter of Example 17 can optionally include wherein the vTPM manager is to create an independent key hierarchy for respective vTPM instances created by the vTPM manager, and wherein the independent key hierarchy is unlinked from a key hierarchy of the dTPM, and wherein the vTPM manager is to generate an endorsement key for the respective vTPM instances created by the vTPM manager. In Example 19, the subject matter of Examples 17-18 can optionally include wherein the programmable circuitry comprises an integrated management complex (IMC) that is to expose two ports to enable communication to the vTPM manager from a compute complex (CC) of the programmable network interface device and from the host device, and wherein the two ports are isolated from one another.
In Example 20, the subject matter of Examples 17-19 can optionally include wherein the IMC is identified as a source IMC for migration of a workload of a migrating vTPM instance of the vTPM instances to a destination host device communicably coupled to a destination IMC of a destination programmable network interface device, and wherein the destination host device is to request the migration of the migrating vTPM instance to the destination IMC hosting a destination vTPM manager, and wherein the destination vTPM manager is to manage the migration of the workload and is to forward TPM commands for the workload to the source IMC as part of the migration of the workload.
Example 21 is a system for scalable TPM in programmable network interface devices. The system of Example 21 can optionally include a cluster of processing units; and a programmable network interface device communicably coupled to the cluster of processing units and comprising one or more processors to implement network interface functionality that comprises: a host interface; a network interface; and a programmable circuitry communicably coupled to the host interface and the network interface, the programmable circuitry comprising: one or more processors to implement network interface functionality; and a discrete trusted platform module (dTPM) to enable the one or more processors to establish a secure boot mechanism for the apparatus; wherein the one or more processors are to instantiate a virtual TPM (vTPM) manager that is associated with the dTPM, the vTPM manager to host vTPM instances corresponding to one or more virtualized environments hosted on at least one of the programmable circuitry or a host device communicable coupled to the apparatus.
In Example 22, the subject matter of Example 21 can optionally include wherein vTPM manager comprises a system service of an operating system (OS) executed by the one or more processors. In Example 23, the subject matter of any one of Examples 21-22 can optionally include wherein the one or more virtualized environments comprise at least one of virtual machines (VMs) or containers hosted by the at least one the programmable circuitry or the host device. In Example 24, the subject matter of any one of Examples 1-3 can optionally include wherein the vTPM manager is to generate an independent key hierarchy for respective vTPM instances hosted by the vTPM manager, and wherein the independent key hierarchy is unlinked from a key hierarchy of the dTPM, and wherein the vTPM manager is to generate an endorsement key for the respective vTPM instances created by the vTPM manager.
In Example 25, the subject matter of any one of Examples 21-24 can optionally include wherein the vTPM manager is to generate the independent key hierarchy by providing two views of platform configuration registers (PCR) registers of the programmable circuitry. In Example 26, the subject matter of any one of Examples 21-25 can optionally include wherein the two views comprise a first set of the PCR registers of a respective vTPM instance that are associated with the dTPM and a second set of the PCR registers of the respective vTPM instance that are associated with the respective vTPM instance.
In Example 27, the subject matter of any one of Examples 21-26 can optionally include wherein the one or more processors are to expose two ports to enable communication to the vTPM manager from a compute complex (CC) of the apparatus and from the host device, and wherein the two ports are isolated from one another.
In Example 28, the subject matter of any one of Examples 21-27 can optionally include wherein an integrated management complex (IMC) of the programmable circuitry is identified as a source IMC for migration of a workload of a migrating vTPM instance of the vTPM instances to a destination host device communicably coupled to a destination IMC of a destination programmable network interface device, and wherein the destination host device is to request the migration of the migrating vTPM instance to the destination IMC hosting a destination vTPM manager, and wherein the destination vTPM manager is to manage the migration of the workload and is to forward TPM commands for the workload to the source IMC as part of the migration of the workload.
In Example 29, the subject matter of any one of Examples 21-28 can optionally include wherein the destination IMC is to export a nonce to the source IMC and wherein the vTPM manager of the source IMC is locked to the nonce, a state of migrating vTPM instance is exported with the nonce, and the nonce is validated before import. In Example 30, the subject matter of any one of Examples 21-29 can optionally include wherein responsive to transfer of the state to the destination IMC being finalized, forwarding of the TPM commands to the source IMC is stopped and the TPM commands are serviced by the destination vTPM manager.
Example 31 is an apparatus for facilitating scalable TPM in programmable network interface devices, comprising means for hosting, via a programmable network interface device, a discrete trusted platform module (dTPM) that enables programmable circuitry of the programmable network interface device to establish a secure boot mechanism for the programmable network interface device; means instantiating, via the programmable circuitry, a virtual TPM (vTPM) manager that is linked to the dTPM; and means for initiating, via the vTPM manager, vTPM instances hosted by the vTPM manager, the vTPM instances corresponding to one or more tenants hosted on at least one of the programmable network interface device or a host device communicable coupled to the programmable network interface device. In Example 32, the subject matter of Example 31 can optionally include the apparatus further configured to perform the method of any one of the Examples 12 to 16.
Example 33 is at least one machine readable medium comprising a plurality of instructions that in response to being executed on a computing device, cause the computing device to carry out a method according to any one of Examples 11 to 16. Example 34 is an apparatus for facilitating scalable TPM in programmable network interface devices, configured to perform the method of any one of Examples 11 to 16. Example 35 is an apparatus for scalable TPM in programmable network interface devices, comprising means for performing the method of any one of Examples 11 to 16. Specifics in the Examples may be used anywhere in one or more embodiments.
The foregoing description and drawings are to be regarded in an illustrative rather than a restrictive sense. Persons skilled in the art will understand that various modifications and changes may be made to the embodiments described herein without departing from the broader spirit and scope of the features set forth in the appended claims.