Network interface cards (NICs) can implement transmission control protocol (TCP) functions in hardware using an approach generally referred to as hardware offload. Typical hardware offload functions include checksum offload and TCP segmentation offload (TSO). The checksum function is needed to ensure that TCP packets that are corrupted during transmission are discarded instead of being delivered to an application. The checksum function can address a data payload, TCP header and parts of an internet protocol (IP) header including source and destination IP addresses, packet length and protocol type.
When an application on a source host sends a large amount of data to a destination host over a TCP connection, that data can be larger than a maximum size supported by an underlying network protocol layer. Typical Ethernet networks support maximum transmission units (MTUs) of 1500 bytes, while other link protocols can have different MTU values. When an application sends data larger than the supported MTU, the TCP layer segments that data into smaller MTU sized data packets. This segmentation and use of smaller MTU sized data packets across a software network stack in a host operating system can consume considerable central processing unit (CPU) overhead. On high bandwidth networks, TSO is a technique that can be used to reduce the CPU overhead of TCP. For ISO, instead of segmenting data in software, large chunks of data are transferred to a NIC for segmentation in hardware.
On a virtualized host that includes one or more virtual machines (VMs), TSO and other functions provided by the NIC can also be beneficial. To provide address space virtualization in large cloud datacenters, the hypervisor of the virtualized host can be used to encapsulate data packets sent by VMs with layer-2 and other upper layer headers. However, if the data packets are not properly encapsulated, benefits of hardware offload techniques that are available on NICs can be negated.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
In a virtualized datacenter, data packets sent by a source virtual machine (VM) can be encapsulated by a hypervisor of a virtualized source that hosts the source VM with new headers for implementing address space virtualization. However, if the data packets are not properly encapsulated, benefits of hardware offload techniques that are available on network interface cards (NICs) can be negated. For example, NICs can perform transmission control protocol (TCP) segmentation offload (TSO), transmit checksumming, and receive checksumming. These NIC functions can reduce central processing unit (CPU) overhead that would be otherwise used by the virtualized source for transmitting data packets to a destination virtual machine hosted on a virtualized destination.
For TSO, instead of segmenting data in software, large chunks of data are transferred to a NIC along with needed header information. The NIC segments the data into maximum transmission unit (MTU) sized data packets (i.e., data packet segments) with the relevant header information. The NIC further creates the headers such that each data packet segment is a valid TCP packet that includes a sequence number. For the virtualized destination, each MTU sized data packet segment can be forwarded to the software TCP stack and then to the application independently without the need to recreate the original larger size data packet.
In addition to TSO, NICs can support checksum offload. On the transmit side, the checksum of the TCP data packet segment is computed and added to the TCP header before transmission. On the receive side, the checksum of the data packet segment is recomputed and compared with the checksum value in the data packet segment header to ensure integrity. If checksumming is not done by the NICs, checksum computation can incur significant CPU overhead at both the transmitting and the receiving sides since every byte of the TCP segment is read for checksum computation.
As discussed above, in a virtualized datacenter, data packets sent by a source VM can be encapsulated by a hypervisor of a virtualized source with new headers for implementing address space virtualization. If the new headers that are added to the data packet are not constructed correctly, TSO and checksum offload support on NICs cannot be leveraged. Instead, the hypervisor of the virtualized source will need to break the source VM data packet into data packet segments before encapsulation, and will also need to compute the data packet segment checksum at the transmitting side. Similar operations will need to be performed by the hypervisor of the virtualized destination. These operations can consume significant number of CPU cycles, which can reduce the efficiency of data packet transmission.
According to an example, a virtual machine data packet encapsulation and decapsulation apparatus and method are described. The method generally includes receiving a data packet including a media access control (MAC) header and an. Internet protocol (IP) header, and encapsulating, by a processor, the received data packet to include an encapsulating MAC header, an encapsulating IP header, a VM MAC header with the same content as the MAC header of the received data packet, and a VM IP header with the same content as the IP header of the received data packet. The method further includes placing the VM MAC header and the VM IP header after the encapsulating MAC header and the encapsulating IP header. The received data packet may be encapsulated to include a TCP header with the same content as a TCP header of the received data packet, and the VM MAC header and the VM IP header may be placed in a TCP options field of the TCP header of the encapsulated data packet. Alternatively, the VM MAC header and the VM IP header may be placed in an IP options field of the encapsulating IP header. The VM MAC header and the VM IP header may be included in data packet segments processed by a NIC, for example, in headers of the data packet segments. The encapsulated data packet may be transmitted to a NIC of a virtualized destination, for example, as processed data packet segments. In one example, the encapsulated data packet is transmitted to a NIC of a virtualized source, processed data packet segments are transmitted from the NIC of the virtualized source to a NIC of a virtualized destination, and a state-less decapsulation layer in a hypervisor of the virtualized destination is used to receive the processed data packet segments from the NIC of the virtualized destination, and to further process and transmit the processed data packet segments to a destination VM on the virtualized destination (in this case, the NIC of the virtualized destination does not combine the processed data packet segments to form the encapsulated data packet). Alternatively, the encapsulated data packet may be transmitted to a NIC of a virtualized source, the encapsulated data packet may be received from a NIC of a virtualized destination, and the encapsulated data packet may be decapsulated to replace the encapsulating MAC header with the VM MAC header, and the encapsulating IP header with the VM IP header (in this case, the NIC of the virtualized destination combines the processed data packet segments to form the encapsulated data packet). The method also includes transmitting the encapsulated data packet to a NIC of a virtualized source, transmitting a processed data packet segment from the NIC of the virtualized source to a NIC of a virtualized destination, receiving the processed data packet segment from the NIC of the virtualized destination, and transmitting the data packet segment to a destination VM on the virtualized destination independently of other processed data packet segments. The NICs for the virtualized source and destination may be used to perform hardware offload operations on the encapsulated data packet, including TSO and checksumming on the transmit side and checksumming on the receive side.
For the virtual machine data packet encapsulation and decapsulation apparatus and method, the encapsulation technique provides for leveraging of TSO and checksum offload support on NICs. Even if a data packet segment is lost, other data packet segments can be processed and delivered by a virtualized destination. For example, any individual data packet segment received at a virtualized destination can be delivered to the destination VM of the virtualized destination, irrespective of which other data packet segments are received. In other words, the decapsulation layer in the hypervisor, or the decapsulation module of the virtualized destination can be state-less, in that a state of the sequence of received data packet segments is not needed to be maintained. The decapsulation module may also be provided in the hypervisor of the virtualized destination. For example, no state needs to be maintained in the kernel of the virtualized destination when receiving the data packet segments. These features can reduce CPU usage at both the virtualized source and the virtualized destination.
The modules 101 and 113, and other components of the apparatus 100 that perform various other functions in the apparatus 100, may comprise machine readable instructions stored on a computer readable medium. In addition, or alternatively, the modules 101 and 113, and other components of the apparatus 100 may comprise hardware or a combination of machine readable instructions and hardware.
Referring to
The encapsulation module 101 encapsulates the data packet 102 to include an encapsulating MAC header 130, and an encapsulating IP header 131. The encapsulation module 101 may further encapsulate the data packet 102 to include a TCP header 132 with the same (or similar) content as the TCP header 122 of the data packet 102, after the encapsulating MAC header 130 and the encapsulating IP header 131. In this manner, the NIC 109 performs operations on the encapsulated data packet 108 as if the encapsulated data packet 108 is a non-encapsulated TCP packet as opposed to an encapsulated data packet. For example, once the NIC 109 receives the encapsulated data packet 108, the NIC 109 can perform TSO and checksumming operations, and forward the processed data packet segments 110 to the NIC 111 of the virtualized destination 107.
The encapsulation module 101 further encapsulates the data packet 102 to include VM MAC and VM IP headers with the encapsulating MAC header 130 and encapsulating IP header 131. As a first option, which is hereinafter denoted a TCP option, the encapsulation module 101 encapsulates the data packet 102 to include a VM MAC header 133 with the same (or similar) content as the MAC header 120 and a VM IP header 134 with the same (or similar) content as the IP header 121 in a TCP options field 135 of the TCP header 132. Specifically, the TCP header 132 includes a variable-bit field denoted TCP options where up to 40 bytes can be carried in a TCP packet. As per the standard defining the TCP protocol, the options field 135 includes a one byte key field “F” at 136 to describe the type of option carried in the TCP options field 135. The VM data packet encapsulation apparatus 100 uses a new option type for the field “F” at 136 to identify the VM MAC and IP headers. Further, a one byte size field “S” at 137 is used to describe the option size (i.e., the size of the VM MAC header 133 and the VM IP header 134), also as defined by the TCP standard. The encapsulated data packet 108 further includes data 138, which is the same as the data 123. Since the VM MAC header 133 and the VM IP header 134 generally span 34 bytes or 36 bytes if the MAC header has a VLAN tag field, these headers can fit into the 40 byte TCP options field 135. The NIC 109, which supports TSO and checksumming operations even for data packets including TCP options, thus processes the encapsulated data packet 108 as if the encapsulated data packet 108 is a non-encapsulated TCP packet.
The NIC 109 segments the encapsulated data packet 108 into MTU sized data packets (i.e., data packet segments) with the relevant header information. The NIC further creates headers such that each of the processed data packet segments 110 is a valid TCP packet that includes a sequence number, and includes a TCP options field. For the virtualized destination 107, since each MTU sized data packet segment includes the VM MAC header 133 and the VM lP header 134 in the TCP options field 135, each data packet segment can be forwarded to the TCP stack independently without the need to recreate the original larger size data packet 102. For example, since the receiving hypervisor 115 can determine the address of the destination VM 106 from the VM MAC header 133 and the VM IP header 134, each data packet segment can be forwarded to the destination VM 106 without using information from previous packets (in this case, the NIC 111 of the virtualized destination 107 does not combine the processed data packet segments 110 to form the encapsulated data packet 112). During processing of the encapsulated data packet 108, the NIC 109 also computes the transmit checksumming. When the processed data packet segments 110 are received by the virtualized destination 107, the receiving NIC 111 computes the checksum and compares it with the checksum in the processed data packet segments 110 to verify integrity.
Referring to
Since all fields in an encapsulating IP header may not be needed for address virtualization (e.g., IP identification (ID)), some of the information from the VM IP header may be encoded into an outer IP header. This reduces the space used for the TCP options field 135 and the IP options field 145 to less than 34 bytes.
The decapsulation module 113 may decapsulate the encapsulated data packet 112 from the NIC 111, and transmit a decapsulated data packet 114 to the hypervisor 115. The decapsulation module 113 may be state-less, in that a state of the sequence of received data packet segments does not need to be maintained. For example, if the NIC 111 does not combine the processed data packet segments 110 to form the encapsulated data packet 112, the decapsulation module 113 may nevertheless receive one or more processed data packet segments 110, decapsulate the received data packet segments 110, and forward the decapsulated data packet segments to the hypervisor 115 for forwarding to the destination VM 106. The decapsulation module 113 may decapsulate the encapsulated data packet 112 (or the received data packet segments 110) from the NIC 111 by removing the encapsulating MAC header 130 and the encapsulating IP header 131, and inserting the VM MAC header 133 and VM IP header 134 in the respective locations of the encapsulating MAC header 130 and the encapsulating IP header 131.
Based on the foregoing, any data packet segment from the processed data packet segments 110 received at the virtualized destination 107 can be delivered to the destination VM 106, irrespective of any other data packet segments received or lost. As a consequence, the decapsulation module 113 of the virtualized destination 107 can be stateless, in that a state of the sequence of received data packet segments does not need to be maintained. Further, for a system that uses either the VM MAC header 133 (or 143) or the VM IP header 134 (or 144) for data packet transmission, the appropriate VM header may be included in the encapsulated data packet 108. The VM MAC header 133 (or 143) and the VM IP header 134 (or 144) may also be disposed at other locations of the encapsulated data packet 108 based on space availability.
Referring to
At block 202, the received data packet is encapsulated to include an encapsulating MAC header, an encapsulating IP header, a VM MAC header with the same (or similar) content as the MAC header of the received data packet, and a VM IP header with the same (or similar) content as the IP header of the received data packet. In some cases, the encapsulating IP header may be omitted. For example, referring to
At block 203, the VM MAC header and the VM IP header are placed after the encapsulating MAC header and the encapsulating IP header. For example, referring to
At block 204, the received data packet is encapsulated to include a TCP header with a same (or similar) content as a TCP header of the received data packet, and the VM MAC header and the VM IP header are placed in a TCP options field of the TCP header of the encapsulated data packet. For example, referring to
At block 205, alternatively, the packet is encapsulated to include a TCP header with a same (or similar) content as a TCP header of the received data packet, and the VM MAC header and the VM IP header are placed in an IP options field of the encapsulating IP header. For example, referring to
At block 206, the VM MAC header and the VM IP header are included in data packet segments processed by the NIC. For example, referring to
Referring to
At block 302, the VM MAC and VM IP headers are retrieved from the encapsulated data packet. For example, referring to
At block 303, the encapsulated data packet is decapsulated. For example, referring to
At block 304, the decapsulated data packet is forwarded to a destination VM. For example, referring to
The computer system includes a processor 302 that may implement or execute machine readable instructions performing some or all of the methods, functions and other processes described herein. Commands and data from the processor 402 are communicated over a communication bus 404. The computer system also includes a main memory 406, such as a random access memory (RAM), where the machine readable instructions and data for the processor 402 may reside during runtime, and a secondary data storage 408, which may be non-volatile and stores machine readable instructions and data. The memory and data storage are examples of computer readable mediums. The memory 406 may include modules 420 including machine readable instructions residing in the memory 406 during runtime and executed by the processor 402. The modules 420 may include the modules 101 and 113 of the apparatus shown in
The computer system may include an I/O device 410, such as a keyboard, a mouse, a display, etc. The computer system may include a network interface 412 for connecting to a network. Other known electronic components may be added or substituted in the computer system.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2012/048959 | 7/31/2012 | WO | 00 | 11/11/2014 |