During the past decade, there has been tremendous growth in the usage of so-called “cloud-hosted” services. Examples of such services include e-mail services provided by Microsoft (Hotmail/Outlook online), Google (Gmail) and Yahoo (Yahoo mail), productivity applications such as Microsoft Office 365 and Google Docs, and Web service platforms such as Amazon Web Services (AWS) and Elastic Compute Cloud (EC2) and Microsoft Azure. Cloud-hosted services and cloud-based architectures are also widely used for telecommunication networks and mobile services. Cloud-hosted services are typically implemented using data centers that have a very large number of compute resources, implemented in racks of various types of servers, such as blade servers filled with server blades and/or modules and other types of server configurations (e.g., 1U, 2U, and 4U servers).
Cloud-hosted services including Web services, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). Cloud Service Providers (CSP) have implemented growing levels of virtualization in these services. For example, deployment of Software Defined Networking (SDN) and Network Function Virtualization (NFV) has also seen rapid growth in the past few years. Under SDN, the system that makes decisions about where traffic is sent (the control plane) is decoupled for the underlying system that forwards traffic to the selected destination (the data plane). SDN concepts may be employed to facilitate network virtualization, enabling service providers to manage various aspects of their network services via software applications and APIs (Application Program Interfaces). Under NFV, by virtualizing network functions as software applications (including virtual network functions (VNFs), network service providers can gain flexibility in network configuration, enabling significant benefits including optimization of available bandwidth, cost savings, and faster time to market for new services.
In the IaaS cloud industry, virtualization is playing a fundamental role. Virtual machine is popular as its elasticity. Meanwhile, physical machines are also indispensable for their high-performance and comprehensive features. Under virtualization in cloud environments, very large numbers of traffic flows may exist, which poses challenges. Supporting packet processing and forwarding for such large number of flows can be very CPU (central processing unit) intensive. One solution is to use so-called “Smart” NICs (Network Interface Controllers) in the compute servers to offload routing and forwarding aspects of packet processing to hardware in the NICs. Another approach uses accelerator cards in the compute servers. However, these approaches do not address aspects of forwarding data and storage traffic between pairs of compute servers and between compute servers and storage servers that are implemented in switches in cloud infrastructures.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
Embodiments of methods and apparatus for smart switch centered next generation cloud infrastructure architectures are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
For clarity, individual components in the Figures herein may also be referred to by their labels in the Figures, rather than by a particular reference number. Additionally, reference numbers referring to a particular type of component (as opposed to a particular component) may be shown with a reference number followed by “(typ)” meaning “typical.” It will be understood that the configuration of these components will be typical of similar components that may exist but are not shown in the drawing Figures for simplicity and clarity or otherwise similar components that are not labeled with separate reference numbers. Conversely, “(typ)” is not to be construed as meaning the component, element, etc. is typically used for its disclosed function, implement, purpose, etc.
In accordance with aspects of the embodiments disclosed herein, smart server switches are provided that support hardware-based forwarding of data traffic and storage traffic in cloud environments employing virtualization in compute servers and storage servers. In one aspect, the hardware-based forwarding is implemented in the data plane using programmable switch chips that are used to execute data plane runtime code in hardware. In some embodiments, the switch chips are P4 (named for “Programming Protocol-independent Packet Processors”) chips.
Each of compute servers 108 and 110 includes software components comprising a management VM 120, one or more VMs 122, and one or more VNFs 124 (only one of which is shown). Each compute server 108 and 110 also includes a NIC (network interface controller) 126 including a P4 NIC chips. Each of storage servers 112 and 114 includes a plurality of storage devices depicted as disks 128 for illustrative purposes. Generally, disks 128 are illustrative of a variety of types of non-volatile storage devices including solid-state disks and magnetic disks, as well as storage devices having other form factors such as NVDIMMs (Non-volatile Dual Inline Memory Modules).
ToR switch 104 is connected to compute server 108 via a virtual local area network (VLAN) link 130 and to compute server 110 via a VLAN link 132. ToR switch 106 is connected to storage server 112 via a VLAN link 134 and to storage server 114 via a VLAN link 136. In the illustrated embodiment, ToR switches 104 and 106 are respectfully connected to aggregation switch 103 via VxLAN (Virtual Extensible LAN) links 138 and 140. VxLAN is a network virtualization technology used to support scalability in large cloud computing deployments. VxLAN is a tunneling protocol that encapsulates Layer 2 Ethernet frames in Layer 4 User Datagram Protocol (UDP) datagrams (also referred as UDP packets), enabling operators to create virtualized Layer 2 subnets, or segments, that span physical Layer 3 networks.
In the illustrated embodiment, kernel 204 is a Linux kernel and includes a Linux KVM (Kernel-based Virtual Machine) 224. A Linux KVM is a full virtualization solution for Linux on x86 hardware containing virtualization extensions (Intel® VT or AMD®-V). It consists of a loadable kernel module, kvm.ko, that provides the core virtualization infrastructure and a processor specific module, kvm-intel.ko or kvm-amd.ko.
User space 206 in used to load and execute various software components and applications. These include one or more management VMs 226, a plurality of VMs 228, and one or more VNFs 230. User space 206 also includes additional KVM virtualization components that are implemented in user space rather than the Linux kernel, such as QEMU in some embodiments. QEMU is generic and open-source machine emulator and virtualizer.
P4-S SCI-NIC block 212 employs a hardware programming language (e.g., P4 language), P4Runtime, and associated libraries to enable NIC Chip 210 to be dynamically programmed to implement a packet processing pipeline. In one embodiment, NIC chip 210 includes circuitry to support P4 applications (e.g., applications written in the P4 language). Once programmed, P4-SSCI-NIC block 212 may support one or more of ACL (action control list) functions, firewall functions, switch functions, and/or router functions. Further details of programming with P4 and associated functionality are described below.
In one embodiment, ToR switch is a “server switch,” meaning it is a switch having an underlying architecture similar to a compute server that supports switching functionality. ToR switch 302 is logically partitioned as hardware 304, an OS kernel 306, and user space 308. Hardware 304 includes one or more CPUs 310 and a P4 switch chip 312. P4 switch chip 314 includes a P4-SSCI-Switch block 314, and multiple ports 316. In the illustrated example, there are 32 ports, but this is merely exemplary as other numbers of ports may be implemented, such as 24, 28, 36, etc.). P4-S SCI-Switch block 314 is programmed using P4 and may support one or more functions including ACL functions, firewall functions, switch functions, and router functions. P4-S SCI-Switch block 314 also operates as a VxLAN terminator to support VxLAN operations.
Application-level software are executed in user space 308. This includes P4 libraries/SDK 318, one or more VNFs 320, and a Statum 322. Stratum is an open source silicon-independent switch operating system for SDNs. Stratum exposes a set of next-generation SDN interfaces including P4Runtime and OpenConfig, enabling interchangeability of forwarding devices and programmability of forwarding behaviors. Stratum defines a contract defining forwarding behavior supported by the data plane, expressed in P4 language.
Architecture 300 further shows an external server 324 running Openstack 326. The OpenStack project is a global collaboration of developers and cloud computing technologists producing an open standard cloud computing platform for both public and private clouds. OpenStack is a free open standard cloud computing platform, mostly deployed as infrastructure-as-a-service (IaaS) in both public and private clouds. Server 324 is also running Neutron 328, which includes a networking-SSCI block 330. Neutron is an OpenStack project to provide “networking as a service” between interface devices (e.g., vNICs) managed by other OpenStack services (e.g., nova). Networking-SSCI block 330 provides communication between Neutron 328 and Stratum 322.
P4 is a language for expressing how packets are processed by the data plane of a forwarding element such as a hardware or software switch, network interface card/controller (NIC), router, or network appliance. Many targets (in particular targets following an SDN architecture) implement a separate control plane and a data plane. P4 is designed to specify the data plane functionality of the target. Separately, P4 programs can also be used along with P4Runtime to partially define the interface by which the control plane and the data-plane communicate. In this scenario, P4 is first used to describe the forwarding behavior and this in turn is converted by a P4 compiler into the metadata needed for the control plane and data plane to communicate. The data plane need not be programmable for P4 and P4Runtime to be of value in unambiguously defining the capabilities of the data plane and how the control plane can control these capabilities.
The control plane 402 aspects of the P4 deployment model enables software running on a server or the like to implement control plane operations using API 412. API 412 provides a means for communicating with and controlling data plane runtime code 410 running on P4 switch chip 312, wherein API 412 may leverage use of P4 Libraries/SKD 318.
Under the configuration illustrated in
Generally, the primary data plane workload of ToR switch 302 and ToR switch 302a is performed in hardware via P4 data plane runtime code executing on P4 switch chip 312. The use of one of more VNFs 320 is optional. Some functions that are commonly associated with data plane aspects may be implemented in one or more VNFs. For example, this may include an VNF (or NFV) to track a customers specific connections.
In some embodiments, P4 switch chip 312 comprises a P4 switch chip provided by Barefoot Networks®. In some embodiments P4 switch chip 312 is a Barefoot Networks® Tofino chip that implements a Protocol Independent Switch Architecture (PISA) and can be programmed using P4. In embodiments, employing Barefoot Networks® switch chips, P4 libraries/SDK and compiler 408 are provided by Barefoot Networks®.
In further detail, architecture 500 depicts multiple compute servers 502 having similar configurations coupled to a ToR switch 504 via links 503. ToR switch 504 is connected to a ToR switch 508 via an aggregation switch 506 and links 505 and 507, and is connected to multiple storage servers 510 via links 511. Alternatively, ToR switch 504 is connected to ToR switch 508 via a direct link 509. Compute server 502 includes one or more VMs 512 that are connected to a respective NVMe (Non-Volatile Memory Express) host 514 implemented in NIC hardware 516. NIC hardware 516 further includes an NVMe-oF (Non-Volatile Memory Express over Fabric) block 518 and an RDMA (Remote Direct Memory Access) block 520 that is configured to employ RDMA verbs to support remote access to data stored on storage servers 510.
In some embodiments ToR switch 504 is a server switch having switch hardware 522 similar to hardware 304. Functionality implemented in switch hardware 522 includes data path and dispatch forwarding 524. Software 526 for ToR switch 504 includes Ceph RBD (Reliable Autonomic Distributed Object Store (RADOS) Block Device) module 528 and one or more NVMe target admin queues 530. Ceph is a distributed object, block, and file storage platform that is part of the open source Ceph project. Ceph's object storage system allows users to mount Ceph as a thin-provisioned block device. When an application writes data to Ceph using a block device, Ceph automatically stripes and replicates the data across the cluster. Ceph's RBD also integrates with Kernel-based Virtual Machines (KVMs).
In some embodiments ToR switch 508 is a server switch having switch hardware 532 similar to hardware 304. Functionality implemented in switch hardware 532 includes data path ACL and forwarding 534. Software 536 for ToR switch 508 includes Ceph Object Storage Daemon (OSD) 538 and one or more NVMe host admin queues 540. Ceph OSD 538 is the object storage daemon for the Ceph distributed file system. It is responsible for storing objects on a local file system and providing access to them over the network.
Storage server 510 includes a plurality of disks 512 that are connected to respective NVMe targets 544 implemented in MC hardware 546. NIC hardware 546 further includes a distributed replication block 548, an NVMe-oF block 550 and an a RDMA block 552 that is configured to employ RDMA verbs to support host-side access to data stored in disks 542 in connection with RDMA block 520 on compute servers. Generally, disks 542 represents some form of storage device, which may have a physical disk form factor, such as an SSD (solid-state disk), magnetic disk, or optical disk, or may comprise another form of non-volatile storage, such as a storage class memory (SCM) device including NVDIMMs (Non-Volatile Dual Inline Memory Modules) as well as other NVM devices.
In addition to the Ceph RBD module 528 and Ceph OSD module 538, other Ceph components may be implemented that are not shown in
Under Architecture 500, the end-to-end data plane forwarding and routing is offloaded to hardware (NVMe-oF hardware and P4 switch hardware), while leveraging aspects of the Ceph distributed file system that support exabyte-level scalability and data resiliency. Moreover, disks 542, which are accessed over links 503, 505, 507, and 509 using RDMA verbs and the NVMe-oF protocol, appear to VMs 512 on compute servers 502 as if they are local disks.
Reference design 600 includes a compute server 602, a ToR switch 604, and a server 606. Compute server 602 includes a user space 608, an OS kernel 610, and a hardware NIC 612. Software components in user space 608 include QEMU 614 and a customer connection tracking NFV 616. QEMU 614 hosts a VM 618 including an application 620 running in user space 622 and a netdev component 624 and an avf driver 625 that are part of kernel 626. QEMU 614 further includes a VFIO to PCIe (virtual function input-output to Peripheral Component Interconnect Express) interface 628 and an LM module 629.
An Adaptive Virtual Function (AVF) mdev (mediated device) kernel module 630 is implemented in kernel 610. AVF mdev kernel module 630 includes a parent device 632 and an mdev instance 634. Parent device 632 includes a VF configuration manager 636, while mdev instance 634 includes an NMAP 638 and supports dirty page tracking 639.
HW NIC 612 is illustrative of a smart NIC that includes a physical function (PF) 640, a first virtual function (VF1) 642, a hardware switch 644, and a port 646. Port 646 is connected to Port 1 on ToR switch 604 via VLAN 132.
ToR switch 604 is generally configured in a similar manner to ToR switch 304 in
Network and NFV reference design 600 support hardware-based forwarding operations during live migration. Under compute server 602, a “slow” path is used internally during live migration that employs dirty page tracking 639 to track memory pages that are dirtied during the live migration. However, the path between compute server 602 and the destination server to be migrated to (not shown) that will include one or more server switches employs fast-path forwarding in hardware using P4 switch chip hardware in the data plane.
Reference design 700 includes a compute server 702, a ToR switch 704, and a server 706. Compute server 702 includes a user space 708, an OS kernel 710, and a hardware NIC 712. Software components in user space 708 include QEMU 714, which hosts a VM 716 including an application 718 running in user space 720 and an NVMe driver 722 that are part of kernel 724. QEMU 614 further includes a VFIO to PCIe interface 726 and an LM module 728.
Kernel 710 includes an NVMe-oF mdev instance 730, an NVMe-oF block 732 and an RDMA block 734. HW NIC 712 is illustrative of a smart NIC that includes a physical function (PF) 736, a first virtual function (VF1) 642, a hardware switch 644, and a port 646. Port 646 is connected to Port 1 on ToR switch 604 via VLAN 132.
P4 switch 704 includes a P4-SSCI block 740 and an SPDK-SSCI block 742 that implements NVMe-oF forwarding and management operations. Server 706 includes openstack 744, cinder 755, and a storage-SSCI block 746. P4-SSCI 740 is also depicted as being virtually connected to NVMe-oF disks 748 and 750, which are representative of any type of block storage device.
Cinder is a Block Storage service for OpenStack. It is designed to present storage resources to end users that can be consumed by the OpenStack Compute Project (Nova). This is done through use of either a reference implementation (LVM) or plugin drivers for other storage. Cinder virtualizes the management of block storage devices and provides end users with a self-service API to request and consume those resources without requiring any knowledge of where their storage is actually deployed or on what type of device.
Another aspect of the architectures and references designs described and illustrated herein is support for multi-tenant cloud environments. Under such environments, multiple tenants that lease infrastructure from CSPs and the like are allocated resources that may be shared, such as compute and storage resources. Another shared resource is the ToR switches and/or other server switches. Under virtualized network architectures, different tenants are allocated separate virtualized resources comprising physical resources that may be shared. However, for security and performance reasons (among others), various mechanisms are implemented to ensure that a given tenants data and virtual resources are isolated and protected from other tenants in multi-tenant cloud environments.
The support for the multi-tenant cloud environment is provided in ToR switches 104a and 106a. As shown, the P4 hardware-based resources and the software-based VNFs and control plane resources are partitioned into multiple “slices,” with a given slice allocated for a respective tenant. The P4 hardware-based slices are depicted as P4 hardware network slices (P4 HW NS) 142 and software-based slices are depicted as software virtual network slices (SW VNS) 144.
In a manner similar to described in the foregoing embodiments, P4 HW NS 142 are used to implement fast-path hardware-based forwarding. SW VNS 144 are used to implement control plane operations including control path and exception path operations such as connection tracking, and ACL. For the perspective of the P4 data plane runtime code, the operation of a server switch is similar whether it is being used for a single tenant or for multiple tenants. However, the ACL and other forwarding table information will be partitioned to separate the traffic flows for individual tenants. The ACL and forwarding table information is managed by the SW VNS 144 for the tenant.
As shown in an architecture 100b in
Although some embodiments have been described in reference to particular implementations, other implementations are possible according to some embodiments. Additionally, the arrangement and/or order of elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some embodiments.
In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.
In the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. Additionally, “communicatively coupled” means that two or more elements that may or may not be in direct contact with each other, are enabled to communicate with each other. For example, if component A is connected to component B, which in turn is connected to component C, component A may be communicatively coupled to component C using component B as an intermediary component.
An embodiment is an implementation or example of the inventions. Reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the inventions. The various appearances “an embodiment,” “one embodiment,” or “some embodiments” are not necessarily all referring to the same embodiments.
Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular embodiment or embodiments. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.
As discussed above, various aspects of the embodiments herein may be facilitated by corresponding software and/or firmware components and applications, such as software and/or firmware executed by an embedded processor or the like. Thus, embodiments of this invention may be used as or to support a software program, software modules, firmware, and/or distributed software executed upon some form of processor, processing core or embedded logic a virtual machine running on a processor or core or otherwise implemented or realized upon or within a non-transitory computer-readable or machine-readable storage medium. A non-transitory computer-readable or machine-readable storage medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a non-transitory computer-readable or machine-readable storage medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a computer or computing machine (e.g., computing device, electronic system, etc.), such as recordable/non-recordable media (e.g., read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.). The content may be directly executable (“object” or “executable” form), source code, or difference code (“delta” or “patch” code). A non-transitory computer-readable or machine-readable storage medium may also include a storage or database from which content can be downloaded. The non-transitory computer-readable or machine-readable storage medium may also include a device or product having content stored thereon at a time of sale or delivery. Thus, delivering a device with stored content, or offering content for download over a communication medium may be understood as providing an article of manufacture comprising a non-transitory computer-readable or machine-readable storage medium with such content described herein.
Various components referred to above as processes, servers, or tools described herein may be a means for performing the functions described. The operations and functions performed by various components described herein may be implemented by software running on a processing element, via embedded hardware or the like, or any combination of hardware and software. Such components may be implemented as software modules, hardware modules, special-purpose hardware (e.g., application specific hardware, ASICs, DSPs, etc.), embedded controllers, hardwired circuitry, hardware logic, etc. Software content (e.g., data, instructions, configuration information, etc.) may be provided via an article of manufacture including non-transitory computer-readable or machine-readable storage medium, which provides content that represents instructions that can be executed. The content may result in a computer performing various functions/operations described herein.
As used herein, a list of items joined by the term “at least one of” can mean any combination of the listed terms. For example, the phrase “at least one of A, B or C” can mean A; B; C; A and B; A and C; B and C; or A, B and C.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.