In some cases, network stack functionality is implemented in software on a host device. For instance, various operating systems provide an Internet protocol suite that provides transport layer functionality via Transmission Control Protocol (“TCP”) and network layer functionality via Internet Protocol (“IP”). When network stack functionality is provided in software, engineers or administrators can test the network stack using configuration settings or by changing source code. However, recent trends involve implementing network stack functionality in hardware. From the perspective of an engineer or administrator, a hardware network stack implementation from a third-party vendor is a “black box” that can be difficult to test and/or analyze.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The description generally relates to analysis of network stack functionality when implemented in hardware. One example includes a method or technique that can be performed on a computing device. The method or technique includes configuring a programmable network device to inject events into network traffic communicated between two or more hosts having network adapters that perform network stack functionality in hardware. The method or technique also includes obtaining mirrored traffic provided by the programmable network device, where the mirrored traffic includes the injected events. The method or technique also includes analyzing the mirrored traffic to obtain analysis results. The analysis results reflect behavior of the network stack functionality in response to the injected events. The method or technique also includes outputting the analysis results.
Another example includes a programmable network device that includes a plurality of ports and a programmable logic circuit. The programmable logic circuit is configured to receive event parameters of one or more events to inject into network traffic communicated between two or more network adapters connected to two or more of the ports. The programmable logic circuit is also configured inject the one or more events into the network traffic and communicate the network traffic having the one or more injected events between the two or more network adapters. The programmable logic circuit is also configured to mirror the network traffic to one or more other devices for subsequent analysis.
Another example includes a system having a hardware processing unit and a storage resource storing computer-readable instructions. When executed by the hardware processing unit, the computer-readable instructions cause the system to configure a programmable network device to inject events into network traffic communicated between two or more hosts having network adapters that perform network stack functionality in hardware. The computer-readable instructions also cause the system to configure the programmable network device to obtain mirrored traffic provided by the programmable network device, where the mirrored traffic includes the injected events. The computer-readable instructions also cause the system to analyze the mirrored traffic to obtain analysis results. The analysis results reflect behavior of the network stack functionality in response to the injected events. The computer-readable instructions also cause the system to output the analysis results.
The Detailed Description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of similar reference numbers in different instances in the description and the figures may indicate similar or identical items.
As discussed above, computer networking generally relies on hardware or software network stacks that implement one or more layers of a networking model. For instance, an operating system such as Microsoft Windows can provide TCP/IP functionality by implementing one or more layers of the Open Systems Interconnection (“OSI”) model in software, such as transport, network, and/or data link layers. Software implementations of network stacks can be analyzed and debugged using software tools to evaluate specific behaviors of each layer of the stack using coding tools.
However, software network stack implementations tend to be relatively slow, as they consume computing resources such as CPU and memory. One alternative to implementing network stack functionality in software involves offloading network stack functionality to hardware, such as network adapters. Hardware stack implementations are particularly suited for applications where high throughput, low latency, and/or low CPU overhead are desirable, such as in modern data centers.
To effectively utilize the performance of hardware network stacks, it is useful for network engineers to have an in-depth understanding of how the hardware network stack behaves. However, it is difficult to test hardware network stacks because the operating system kernel does not have direct control over the hardware. As a consequence, conventional tools for testing of hardware network stacks tend to be inflexible and do not allow for analysis of fine-grained device behavior.
The disclosed implementations offer techniques for evaluating and comparing hardware network stack behavior on different devices, such as network adapters. As discussed more below, programmable network devices, such as switches, are used to emulate various network scenarios involving hardware network stacks. A programmable network device can inject events into network traffic communicated between two or more hosts, and mirror the traffic with the events to one or more traffic pooling servers. The use of a programmable network device as described herein allows for deterministic event injection to be configured using user-friendly interfaces that allow engineers to write precise, reproducible tests without necessarily modifying the underlying hardware or directly analyzing the hardware logic of the network devices being tested.
For the purposes of this document, the term “programmable network device” means a network device that can be programmed using instructions to implement specific behaviors, such as parsing of packets and match-action behavior. One example of a programmable network device is an Intel® Tofino™ programmable switch that can be programmed using the P4 programming language. A programmable network device can forward traffic among one or more ports to connected devices. For instance, a programmable network device can use Media Access Control (“MAC)” or other hardware addresses of destination devices to determine on which port to forward received network packets.
In some cases, a programmable network device does not merely forward traffic, but can also inject events into the traffic. An “event” is a modification to traffic received by the programmable network device that is inserted by the programmable network device before forwarding the traffic to its destination address. A programmable network device can “mirror” traffic by sending copies of the traffic to another device, such as a traffic pooling server.
The term “host” means any device connected to a network. Hosts can access the network using a “network adapter” or “network interface card” (NIC) which can physically connect a host to a given network. Some network adapters implement various types of network stack functionality in hardware, referred to herein as a “hardware network stack.” One example of a hardware network stack is a TCP Offload engine or “TOE” that offloads processing of the entire TCP/IP stack to the NIC. This technology can reduce the number of central processing unit (“CPU”) cycles involved in network communications, and can also reduce traffic over a hardware bus such as a Peripheral Component Interconnect Express (“PCIe”) bus.
Another example of a hardware network stack provides remote direct memory access (“RDMA”) functionality to transfer data from pre-registered memory to a network or from the wire to the network. The networking protocol is implemented on the network adapter, thus bypassing the operating system kernel. Hardware RDMA implementations can free CPU cores and accelerate communication-intensive workloads, such as storage and machine learning workloads. Example RDMA implementations include RDMA over Converged Ethernet (RoCE), RoCE version 2 (RoCEv2), and Internet Wide Area RDMA Protocol (iWARP) which runs RDMA over TCP/IP on the NIC.
Another example of a hardware network stack is Scalable reliable datagram (SRD), which is a customized transport protocol that sprays packets over multiple paths to minimize the chance of hotspots and use delay information for congestion control. Unlike TCP and RDMA reliable connection (RC), SRD provides reliable but out-of-order delivery and leaves order restoration to applications.
Certain components of the devices shown in
Each of the devices in system 100 can include processing resources 101 and storage resources 102. As described more below, storage resources can be any type of device suited for volatile or persistent storage of code or data, e.g., RAM, SSD, optical drives, etc. Processing resources can include any type of device suitable for executing code and/or processing data. In some cases, processing resources can include general-purpose processors such as CPU's. In other cases, e.g., programmable network device 140, the processing resources can include specialized circuits such as an application-specific integrated circuit (“ASIC”). For instance, the programmable network device 140 can be a programmable switch and processing resources 101(4) can include an ASIC programmed using the P4 data forwarding language.
Orchestration server 110 includes an orchestrator 112 that configures network communications and event injections in the system. The orchestrator includes a configuration module 114 that configures the requesting server 120 and the responding server 130 to communicate traffic to one another through programmable network device 140. The configuration module also configures the programmable network device to inject events into the traffic, and to mirror the traffic to traffic pooling server 150 and traffic pooling server 160.
In some cases, the orchestrator 112 configures the programmable network device 140 by sending a configuration file to the programmable network device. The programmable network device can receive traffic sent from the requesting server 120 to the responding server 130, inject events having event parameters specified by the configuration file into the traffic using event injector 142, and forward the traffic to the responding server 130. The programmable network device can also receive responses from the responding server and forward them to the requesting server. The programmable network device can mirror all of the traffic to the traffic pooling server 150 and traffic pooling server 160. In some cases, the event injector configures the events by processing the configuration file received from the orchestrator to modify a map-action table 144.
The requesting server 120 and the responding server 130 can have respective instances of a test network adapter 103(2) and 103(3). The test network adapters can have respective hardware stack instances 104(2) and 104(3), as described more below. The requesting server 120 can have a request generator 122 that generates requests and sends them using test network adapter 103(2) to the responding server 130 via programmable network device 140. The responding server 130 can have a response generator 132 that generates responses and sends them using test network adapter 103(3) to the requesting server 130 via programmable network device 140.
The traffic pooling server 150 and the traffic pooling server 160 can have respective instances of traffic pooling module 105(5) and 105(6), which can collectively store the traffic mirrored by the programmable network device 140. In some cases, the traffic pooling servers employ the same type of network adapter being tested on the requesting and responding servers.
The orchestrator 112 can have a data gathering module 116 that gathers the pooled traffic as well as other results, such as network stack counters, log files, and switch counters. The analysis module 118 can analyze the gathered data and output results of the analysis. As described more below, the analysis results can reflect network stack behavior implemented in hardware on the test network adapters 103.
The request generator 122 can communicate request traffic as instructed by the orchestrator, where at least some network stack functionality is offloaded from software to the hardware network stack 104(2). The response generator 132 can communicate response traffic as instructed by the orchestrator, where at least some network stack functionality is implemented by the hardware network stack 104(3).
The event injector 142 forwards the traffic via L2/L3 forwarding 204 and injects events 206, such as explicit congestion notification (ECN) marks, packet losses, and/or packet corruptions. The event injector also sends mirrored packets 208 to the traffic pooling modules 105 and maintains a traffic counter 210.
Once the traffic finishes, the orchestrator 112 collects results from the other components, e.g., mirrored packets, counters, log files, etc. The orchestrator reconstructs the complete packet trace from dumped packets collected by the traffic pooling modules 105 and can then parse the packet trace to analyze behaviors of the hardware network stack to output an analysis 212.
Before starting traffic generation by the request generator 122 and the response generator 132, the orchestrator 112 first configures IP addresses and network stack settings, e.g., congestion control and loss recovery parameters, of requesting server 120 and responding server 130.
After configuration, the orchestrator instructs the traffic generating hosts to initiate traffic generation. For instance, the traffic can be generated using Reliable Connected (RC) transport, supporting RDMA send, receive, write, and read verbs, denoted as SEND, RECV, WRITE and READ. The request generator 122 and response generator 132 communicate using one or multiple queue pairs (QPs). As shown in
After objects such as QPs and memory regions (MRs) are initialized by the hosts, they exchange metadata, e.g., QP number (QPN), packet sequence number (PSN), global identifier (GID), memory address and key, through a TCP connection. Since QPNs and PSNs are randomly generated at runtime and useful for the event injector to correctly identify packets, the hosts can send the metadata information to the event injector 142 as well.
After exchanging metadata and establishing QP connections, the request generator 122 posts work requests to generate RDMA traffic. The request generator controls the total number of requests/messages and the maximum number of outstanding requests on each QP. In the case of SEND/RECV, the response generator 132 continues posting corresponding RECV requests. The request generator can support barrier synchronization among QPs by posting the next round of requests only after receiving the completions of the current round of requests across all the QPs. Thereafter, when the request generator 122 receives completions of all the requests, the request generator can calculate metrics such as request/message completion times and total goodput, and send a completion notification to the response generator through the TCP connection.
To emulate realistic network scenarios like congestion and failures, the event injector 142 can be programmed via a configuration file to inject packet corruptions, packet drops and ECN marks to RDMA data packets. Generally, the response generator 132 can generate data packets for READ while the request generator 122 can generate data packets for the other verbs.
The disclosed implementations aim to provide user-friendly interfaces for engineers to express a series of deterministic injection events. To provide precise, reproducible results, the events can be deterministic, e.g., instead of randomly dropping a specified percentage of packets, the events can be specified more precisely, e.g., “drop the first packet of the first QP” can generate deterministic injection behaviors. This allows a user to express their high-level testing intents without the need to understand low-level details, e.g., a user can configure the programmable network device to drop the first packet of the second QP, then drop the retransmission of this packet. The user does not need to specify QPN and PSN for each QP, nor understand how the event injector 142 identifies the retransmitted packet.
The programmable network device 140 can translate high-level test intents (e.g., relative QPN and PSN) expressed in the configuration file to the low-level configuration for event injection. An iteration number (ITER) can be employed to express per-connection retransmission behaviors.
As noted, a configuration file can provide high-level intent information such as relative PSN and QPN. In some implementations, the event injector 142 can detect new QPs and parse their QPNs and initial PSNs on the data plane. However, this stateful approach tends to complicate the data plane because, for every RDMA packet, the event injector first checks if it belongs to a new QP. If yes, the event injector 142 further needs to initialize states for this new QP.
Other implementations employ a stateless approach by leveraging traffic generators to provide runtime traffic metadata. As mentioned above, traffic metadata like QPN and initial PSN (IPSN) is randomly generated at runtime. Once the traffic generator instances finish exchanging metadata through TCP, the request generator 122 sends the complete traffic metadata to the event injector 142 through the control plane. The metadata is organized as a list of tuples. Each tuple contains the information for a certain QP connection: requester IP/QPN/IPSN and responder IP/QPN/IPSN. After that, the event injector combines the runtime traffic metadata from traffic generators and traffic configuration intents from the orchestrator 112 to populate the match-action table for event injections. After the event injector populates the table, the requesting server 120 and responding server 130 can start communicating RDMA traffic.
In some cases, users wish to inject events to retransmitted packets to understand behaviors like retransmission timeout backoff. However, the retransmitted packet cannot necessarily be differentiated from the original packet by looking into the RoCEv2 packet header since they have the same IP addresses, UDP ports, QPN and PSN.
Thus, some implementations employ an iteration number ITER, which denotes the rounds of transmissions for a connection. ITER starts from 1 and is maintained by the event injector 142. For every arriving packet, the event injector compares its PSN with PSN of the last packet of its connection. If PSN of the current packet is not larger than that of the previous packet, the event injector identifies this as a new round of transmissions and increases ITER of this connection by 1. Regardless of the comparison result, the event injector always updates PSN of the last packet of the connection using PSN of the current packet. Thus, PSN, ITER can be used to uniquely identify every packet in a connection prior to injecting events.
Some implementations can dump all communicated packets between traffic generators to one or more pooling servers for offline analysis. One approach would be to employ tools like ibdump to dump packets at the end host. However, traffic dumping at the end host could plausibly impact behaviors of network stacks. Furthermore, reconstruction of the complete packet trace from packets dumped at both traffic generation hosts could involve nanosecond-level clock synchronization, which is non-trivial.
Thus, in some implementations, the event injector 142 is configured to mirror all the packets to a group of dedicated traffic pooling servers. Packet mirroring essentially clones packets of specified interfaces and forwards them to other interfaces for examination. For instance, all packets at the ingress pipeline can be mirrored before actually dropping any packets by the programmable network device 140. Ingress mirroring instead of egress mirroring allows capture of original behaviors of hardware network stacks.
For the ease of integrity check and traffic analysis, the event injector 142 can embed certain metadata in mirrored packets. To avoid losses during packet dumping, a per-packet load balancing mechanism can be employed to evenly distribute mirrored packets across CPU cores of traffic pooling servers 150 and 160.
In some implementations, the event injector 142 embeds three types of metadata, mirror sequence number, event type and mirror timestamp, in mirrored packets for the following purposes.
1. Integrity check. As mirrored packets are employed to understand behaviors of the hardware network stack, it is useful to ensure that all the packets are mirrored and stored by the traffic pooling servers. To this end, the event injector maintains a global variable, mirror sequence number, which is incremented for every arriving RDMA packet and embedded in each mirrored packet. Together with switch port counters, subsequent analysis can verify if there is any packet loss. If all the packets are dumped, the resulting trace should include consecutive mirror sequence numbers, and the largest mirror sequence number should match the total number of received packets.
2. Indicating events. For subsequent analysis of mirrored packets, the event injector 142 can include an event type in each packet to indicate the injected event, e.g., the event types can include ECN marking, packet drop, packet corruption, no event, and so on. Note that all packets can be mirrored at the ingress pipeline before packets are dropped.
3. Fine-grained measurement. To accurately measure behaviors of the hardware offloaded network stack, the event injector 142 can embed a mirror timestamp in each mirrored packet, which carries the nanosecond-level time when the original packet enters the ingress pipeline. Since the event injector adds timestamps to all the packets, it does not necessarily require clock synchronization.
To embed the above metadata in mirrored packets, one approach involves expanding packets with new fields storing these metadata. However, this may overload the bandwidth capacity of mirroring ports if original traffic's throughput is close to line rate. To avoid this, the event injector can rewrite existing header fields that are not involved in traffic analysis to store above metadata. In some implementations, the Time to Live (TTL) field, the source MAC address field, and the destination MAC address field are used to store event type, mirror sequence number, and mirror timestamp, respectively.
One approach for pooling traffic involves using separate hosts to store mirrored packets generated by the requester and the responder, respectively. However, this approach may result in discarded packets when receiving line-rate mirrored packets. Although these invalid tests can be identified by integrity checks, this degrades efficiency. Thus, some implementations employ a per-packet load balancing mechanism to evenly distribute mirrored packets across CPU cores of multiple traffic pooling servers. The user can flexibly set up hosts as long as the total capacity of the pooling servers is sufficient to process bi-directional line-rate traffic with minimum-sized packets. The event injector 142 can implement a weighted round-robin load balancing scheme to forward mirrored packets to different traffic pooling servers based on their processing capacity. Though the requester and responder generate traffic at heterogeneous rates, the event injector can evenly distribute mirrored packets to homogeneous traffic pooling servers.
At each traffic pooling server, Receive Side Scaling (RSS) can be employed to distribute packets across multiple CPU cores. However, RSS preserves flow to CPU affinity by hashing certain packet fields to select a CPU core. As a result, the CPU processing capacity depends on the number of flows in the test. To fully exploit CPU cores, the event injector 142 can be configured to rewrite the UDP destination port to a random number. Note that UDP destination port number 4791 is reserved for RoCEv2. By rewriting this port number, an illusion of many concurrent flows to RSS can be created, thus efficiently leveraging CPU processing capacity.
Once traffic generators stop, the orchestrator 112 terminates all the other components and collects various result files shown in result table 700, shown
Upon collecting all the result files, the orchestrator 112 reconstructs the packet trace from packets collected by the traffic pooling servers. Since the event injector 142 maintains the mirror sequence number and stores this on the source MAC address field of each mirrored packet, the orchestrator can sort all the packets based on their mirror sequence numbers.
After the packet trace reconstruction, the orchestrator 112 runs an integrity check using the following four conditions to determine if the packet trace is complete without losses of any packets during traffic mirroring and dumping:
If these conditions hold, a complete traffic trace has been constructed. Otherwise, an error can be reported to convey that the test data is invalid and that analysis based on the collected test data is unlikely to be accurate.
Once a given test passes these integrity checks, reconstructed packet traces, log files, and counters can be analyzed. In some cases, the analysis module 118 of orchestrator 112 provides a set of built-in analyzers for certain features in RDMA, e.g., Go-back-N retransmission and ECN-based congestion control.
Retransmission logic. Retransmission can be tested to ensure reliable delivery. In some implementations, a retransmission logic analyzer is employed to check if the network adapter follows the corresponding specification when a packet is dropped. For instance, a network adapter that implements the Go-back-N specification should generate a negative-acknowledgement (“NACK”) packet when it receives out of order arriving packets. To realize this, the specification of Go-back-N can be translated into a finite-state machine (FSM) and the reconstructed packet trace can be fed into this FSM. If the packet trace is not accepted by the FSM, the network adapter's retransmission implementation does not fully comply with the specification.
Retransmission performance. In some cases, RDMA is employed over lossy networks. Lossy RDMA technologies heavily rely on efficient retransmission implementations. For example, when a network adapter receives a NACK packet or a selective acknowledgement (“SACK”) packet, ideally the network adapter will start the retransmission immediately, rather than wait for a long time.
To analyze retransmission performance of a given network adapter, a retransmission performance analyzer can be employed. Note that this tool can be used in combination with the above retransmission logic analyzer to determine if the network adapter under test has a correct and efficient retransmission implementation. The retransmission performance analyzer can deal with both fast retransmissions (triggered by NACK/SACK) and timeout retransmissions (due to tail losses), and provide the performance breakdown to help users to identify the bottleneck.
Congestion notification. Data center quantized congestion notification (“DCQCN”) is the de facto RoCEv2 congestion control protocol implemented in certain network adapters. Once the DCQCN notification point (NP, receiver) receives ECN-marked packets, the receiver notifies the reaction point (RP, sender) to reduce the rate using Congestion Notification Packets (CNPs). Recent network adapter models extend DCQCN to lossy networks. When the notification point receives out-of-order packets, it generates both NACKs and congestion notification packets (CNPs) to notify the reaction point to start the retransmission and lower the sending rate. In addition, to reduce the volume of CNP traffic and the CNP processing overhead, some network adapters also have a CNP pacer at the notification point side, which determines the minimum interval between two consecutive generated CNPs.
In summary, the generation of CNPs depends on ECN-marked packets, packet losses and the CNP pacer configuration. In some cases, the orchestrator 112 provides an CNP analyzer to check if CNPs are generated as expected under various network conditions and CNP pacer configurations.
Hardware network stack counter. Some implementations also employ a counter analyzer to check if counters of the hardware network stack are updated correctly. Counters can be provided for retransmission, timeout, congestion and packet corruption, counters of sent/received packets, sequence errors, out-of-sequences, timeouts (and retry), packets with redundancy coding errors, discarded packets, CNPs sent/handled, etc.
In some implementations, the event injector 142 modifies packets and set a drop flag at the ingress pipeline. The events are injected by manipulating the packet field or intrinsic metadata. The egress pipeline includes a module to rewrite packet fields of mirrored packets. Both incoming and outgoing packets are tracked, including mirrored packets, on each port for integrity check. The switch control plane can be coded to translate RPC calls to configure the data plane modules and dump port counters after a given experiment finishes.
In some implementations, the traffic generator hosts employ Libibverbs to generate RDMA traffic over RC transport. A traffic generator can control the IP associated with each QP to emulate traffic from multiple hosts. A traffic generator can also report total goodput and average request/message completion times for each QP.
The traffic pooling hosts can be configured using data plane development kit (“DPDK”) and Receive Side Scaling (RSS) to dispatch the packets among the receiver queues and cores. The traffic pooling hosts can buffer packets in the pre-allocated memory during a given test and write the packets to storage upon receiving a TERM message from the orchestrator.
The disclosed implementations were employed to conduct experiments on three commercially-available network adapters, referred to below as NIC A, NIC B, and NIC C. Each test was conducted using a total of four servers connected to a network switch with a programmable ASIC, which works as the event injector. Each server had a multi-core CPU and a corresponding NIC card running a Linux-based operating system. Two traffic generating servers were used to generate traffic, and two traffic pooling servers were used to dump mirrored packets.
Overhead of event injection and mirroring. An experiment was conducted to test the overhead of event injection and traffic mirroring. The traffic generator was configured to continue sending 1000 messages with a fixed size over a single QP, and the average message completion time (MCT) was measured. The messages were sent back-to-back and the experiment with was conducted with different message sizes: 1 KB, 10 KB and 100 KB.
As shown in
Benefit of per-packet load balancing. The efficiency of the disclosed techniques depends to some extent on the reliability of packet dumping.
The experiment was run for 100 rounds, the ratio of rounds that pass an integrity check was measured. As shown in
When packets in the middle are dropped, the receiver can observe out-of-order packets and generate NACK or SACK to trigger fast retransmissions. NICS A, B, and C adopt Go-back-N as the default fast retransmission algorithm. The disclosed techniques were used to evaluate fast retransmission behaviors of these NICS by deliberately dropping packets. All of the network adapter models passed the FSM-based retransmission logic check described previously, indicating that their retransmission implementations follow the specification.
Setting. In this experiment, a traffic generator uses one connection to generate WRITE traffic with only a single outstanding request. For each message, one packet was dropped with a different (relative) PSN. Message size was fixed as 20 KB and 100 KB respectively. The experiment was run for 1000 iterations and the average latency was computed. The Go-back-N retransmission latency was broken into two parts: the NACK generation latency and the NACK reaction latency.
Performance improvement. As shown in
Retransmission might be blocked. While NICS B and C deliver low NACK reaction latency, the NACK reaction latency still varies. As shown in
Another set of experiments was conducted to further investigate this behavior. This time, the effect of message size was investigated by dropping the fifth packet of a message and varying the message size from 10 KB to 200 KB. The packets are sent back-to-back.
One plausible explanation for this anomaly: retransmitted packets and original packets share the same transmission pipeline on the NIC, and retransmitted packets cannot preempt original packets that are already in the pipeline. As a result, retransmitted packets may be delayed. When the sender transmits a packet, it may push the packet to the tail of the pipeline no matter whether it's a retransmitted packet or a normal packet. If the message is short (e.g., 10 KB), the pipeline is already empty when the retransmission happens because all the packets have been transmitted. Otherwise, the pipeline might still retain a few packets that haven't been sent when a packet is going to be retransmitted, if the message is relatively large (e.g., 100 KB). This is one example of an analytical inference that a user can make from experiments conducted using the disclosed techniques.
When tail packets or retransmitted packets are dropped, the sender can only use the retransmission timer to recover them. Inappropriate timeout values can lead to either spurious retransmissions or poor tail performance. In this section, findings related to timeout retransmissions are reported.
Setting. When creating QPs, Libibverbs provides an interface to configure the timeout and retry_cnt value. Default values were employed. timeout is set to 14, meaning that the minimum timeout is 4.096 μs*2timeout=0.0671 s [45]. retry_cnt indicates the maximum number of times that the QP will try to resend the packets before reporting an error, and was set to 7.
In the experiment, the programmable network device continues dropping the tail packet to trigger timeouts. Specifically, one connection is used to send 5 WRITE messages. For each message, the size is 10 KB, and the tenth (tail) packet is dropped for 7 (retry_cnt) times. Experiments were run in both adaptive and non-adaptive mode (default) respectively. The results are shown in
Unexpected timeout value. When adaptive retransmission is enabled, the actual timeout value changes according to the packet loss frequency. It is worth noting that except for the first message, the first timeout of a message jumps to a high level unexpectedly (e.g., 0.267 s for the second message and 0.671 s for the latter messages). Furthermore, the timeout value is not bounded by the pre-configured 0.0671 s: for the first message, some of the retransmission timeouts are smaller than 0.0671 s: 0.0056 s, 0.0041 s, 0.0084 s, 0.0167 s, 0.0251 s, and 0.1342 s. For non-adaptive retransmission, the timeout value obeys the specification: the first timeout is around 0.2-0.4 s, then the following timeouts are static at about 0.537 s. All these values are larger than the minimum timeout 0.0671 s.
Retry times. Experiments also revealed that the maximum retry times is correctly enforced in non-adaptive mode, but not necessarily in adaptive mode. 5 WRITE messages were sent while dropping the last packet of each message until an error was reported.
Another experiment illustrates congestion notification packet (CNP) generation, which is employed for congestion control. The following experiments were conducted to determine whether CNP generation of one connection would be affected by ECNs or losses from another connection.
Setting. This experiment was conducted using three connections, each sending a 1 MB WRITE message. All the three connections have the same responder address (10.0.0.1) but different requester addresses (10.0.0.11, 10.0.0.12, and 10.0.0.13) to simulate a scenario that three requesters are sending WRITE traffic to one responder. The QPs at the responder side are denoted as QP1, QP2 and QP3 respectively. The network adapters employed use a parameter named min_time_between_cnps to control the CNP generation interval. In this experiment, min_time_between_cnps was set to 50 μs. The three connections are sending traffic simultaneously, and ECN was marked for the 50-th packet and the 950-th packet of each connection. The DCQCN reaction functionality in the requester was disabled to avoid adjusting the sending rate upon receiving CNPs. After the experiment, the mirrored traffic is analyzed to determine how many CNPs does each (sender) QP receives.
Per-port interval or per-dstIP interval. As shown in
The experiments shown above are just a few examples of how the disclosed implementations allow for users to set up and conduct experiments to analyze network stack behavior implemented hardware. As discussed previously, it is difficult for network engineers to fully evaluate the behavior of network stack that is implemented in hardware using conventional approaches. While hardware vendors provide documents that describe the specifications of a given network device, there are often certain behaviors that are not fully detailed in the documentation, or the documentation can be incomplete or even incorrect.
For software network stack behavior, it is relatively easy to use code such as a “shim layer” to configure tests of the network stack. On the other hand, it is far more difficult for network engineers to configure similar tests for hardware. From the perspective of an end user, a network adapter is a “black box” and network engineers cannot readily modify behavior of the network adapter.
By using a programmable network device as described herein to inject events into traffic and mirror the traffic for subsequent analysis, it is possible for network engineers to specify individual tests that they would like to conduct. The use of a configuration file or GUI to specify event parameters for testing enables the network engineers to have fine control over which behavior they would like to test. Thus, users are able to conduct very precise, deterministic tests of hardware network stack functionality without modifying the hardware itself.
Prior approaches for testing hardware stack behavior, such as RDMA implementations, typically involve running synthetic workloads to measure end-to-end performance in testbed and test cluster. This approach can reveal certain functional bugs, but may not accurately capture micro-behaviors like per-packet transmission time.
In addition, hardware network stacks provide high throughput and ultra-low latency. Thus, testing tools to accurately test hardware network stacks under realistic conditions should be able to do so at high speed with low extra delay. In the disclosed implementations, multiple traffic pooling servers are provided to handle full line-rate mirrored traffic. In some cases, the traffic pooling servers employ the same network adapter as the hosts being tested.
The use of a programmable network device to inject the events and mirror the traffic also allows the events to be injected with low latency. As shown above, the mirroring can be performed using a load-balancing approach that allows for full line-rate testing while injecting metadata such as sequence number, timestamp, and event type into the mirrored packets for subsequent analysis.
Method 1600 begins at block 1602, where event parameters are received. For instance, the event parameters can be received by user input editing a configuration file or via a configuration graphical user interface. The events parameters can specify event types such as packet corruptions, packet drops, or explicit congestion notifications.
Method 1600 continues at block 1604, where a programmable network device is configured to inject events based on the event parameters. For instance, the programmable network device can be configured by sending a configuration file with the event parameters to the programmable network device. The configuration file can cause the programmable network device to inject the events into network traffic communicated between two or more hosts having network adapters that perform network stack functionality in hardware.
Method 1600 continues at block 1606, where mirrored traffic is analyzed to obtain analysis results reflecting behavior of network stack functionality. The mirrored traffic can be obtained from one or more devices, such as traffic pooling servers 150 and 160.
Method 1600 continues at block 1608, where the mirrored traffic is analyzed to obtain analysis results reflecting behavior of the network stack functionality of the network adapters. For instance, behavior such as retransmission logic, retransmission latency, timeout values, retry counts, and congestion notification behavior can be analyzed using the mirrored traffic.
Method 1600 continues at block 1610, where analysis results are output. For instance, the analysis results can be written to a file in storage or displayed via one or more graphical user interfaces as shown in any of
Blocks 1602 and 1604 can be performed by the configuration module 114 of the orchestrator 112. Block 1606 can be performed by the data gathering module 116 of the orchestrator. Blocks 1608 and 1610 can be performed by the analysis module 118 of the orchestrator.
Method 1700 begins at block 1702, where event parameters are received. For instance, the event parameters can be received in a configuration file.
Method 1700 continues at block 1704, where events are injected into traffic communicated between two or more network adapters connected to two or more ports. For instance, the events can be injected by parsing the configuration file to extract the event parameters and configuring a map-action table according to the extracted event parameters.
Method 1700 continues at block 1706, where the traffic is distributed. For instance, the traffic with the events injected thereto can be sent out on a particular port based on a destination hardware address of a particular network adapter that is connected to that particular port.
Method 1700 continues at block 1708, where the traffic is mirrored to one or more other devices. For instance, the other devices can include one or more traffic pooling servers.
In some cases, the configuration module 114 of the orchestrator 112 can provide a graphical user interface that allows automated generation of configuration files.
Users can add new events to a configuration file by selecting add event element 1805. When finished adding events, the user can select submit configuration element 1806, which causes the configuration module to generate a configuration file specifying the events. The configuration module can configure the programmable network device, traffic generating hosts, and traffic pooling servers for subsequent testing based on the configuration file.
As noted above with respect to
The term “device”, “computer,” “computing device,” “client device,” and or “server device” as used herein can mean any type of device that has some amount of hardware processing capability and/or hardware storage/memory capability. Processing capability can be provided by one or more hardware processors (e.g., hardware processing units/cores) that can execute computer-readable instructions to provide functionality. Computer-readable instructions and/or data can be stored on storage, such as storage/memory and or the datastore. The term “system” as used herein can refer to a single device, multiple devices, etc.
Storage resources can be internal or external to the respective devices with which they are associated. The storage resources can include any one or more of volatile or non-volatile memory, hard drives, flash storage devices, and/or optical storage devices (e.g., CDs, DVDs, etc.), among others. As used herein, the term “computer-readable medium” can include signals. In contrast, the term “computer-readable storage medium” excludes signals. Computer-readable storage media includes “computer-readable storage devices.” Examples of computer-readable storage devices include volatile storage media, such as RAM, and non-volatile storage media, such as hard drives, optical discs, and flash memory, among others.
In some cases, the devices are configured with a general-purpose hardware processor and storage resources. In other cases, a device can include a system on a chip (SOC) type design. In SOC design implementations, functionality provided by the device can be integrated on a single SOC or multiple coupled SOCs. One or more associated processors can be configured to coordinate with shared resources, such as memory, storage, etc., and/or one or more dedicated resources, such as hardware blocks configured to perform certain specific functionality. Thus, the term “processor,” “hardware processor” or “hardware processing unit” as used herein can also refer to central processing units (CPUs), graphical processing units (GPUs), neural processing units (NPUs), controllers, microcontrollers, processor cores, or other types of processing devices suitable for implementation both in conventional computing architectures as well as SOC designs.
Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
In some configurations, any of the modules/code discussed herein can be implemented in software, hardware, and/or firmware. In any case, the modules/code can be provided during manufacture of the device or by an intermediary that prepares the device for sale to the end user. In other instances, the end user may install these modules/code later, such as by downloading executable code and installing the executable code on the corresponding device.
Also note that devices generally can have input and/or output functionality. For example, computing devices can have various input mechanisms such as keyboards, mice, touchpads, voice recognition, gesture recognition (e.g., using depth cameras such as stereoscopic or time-of-flight camera systems, infrared camera systems, RGB camera systems or using accelerometers/gyroscopes, facial recognition, etc.). Devices can also have various output mechanisms such as printers, monitors, etc.
Also note that the devices described herein can function in a stand-alone or cooperative manner to implement the described techniques. For example, the methods and functionality described herein can be performed on a single computing device and/or distributed across multiple computing devices that communicate over network(s) 650. Without limitation, network(s) 650 can include one or more local area networks (LANs), wide area networks (WANs), the Internet, and the like.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims and other features and acts that would be recognized by one skilled in the art are intended to be within the scope of the claims.
Various examples are described above. Additional examples are described below. One example includes a method comprising configuring a programmable network device to inject events into network traffic communicated between two or more hosts having network adapters that perform network stack functionality in hardware, obtaining mirrored traffic provided by the programmable network device, the mirrored traffic including the injected events, analyzing the mirrored traffic to obtain analysis results, the analysis results reflecting behavior of the network stack functionality in response to the injected events, and outputting the analysis results.
Another example can include any of the above and/or below examples where the injected events comprise packet corruptions.
Another example can include any of the above and/or below examples where the injected events comprise packet drops.
Another example can include any of the above and/or below examples where the injected events comprise explicit congestion notification marks.
Another example can include any of the above and/or below examples where the method further comprises populating a configuration file with event parameters for the events and sending the configuration file to the programmable network device, the programmable network device being configured to read the configuration file and inject the events according to the event parameters.
Another example can include any of the above and/or below examples where the configuration file identifies a particular queue pair for which the programmable network device is to inject a particular event.
Another example can include any of the above and/or below examples where the configuration file identifies a particular packet sequence number of a particular packet in which to inject the particular event.
Another example can include any of the above and/or below examples where the configuration file identifies a particular event type of the particular event.
Another example can include any of the above and/or below examples where the configuration file identifies an iteration number specifying a transmission round in which the particular event is to be injected by the programmable network device.
Another example can include any of the above and/or below examples where the analysis results reflect retransmission latency.
Another example can include any of the above and/or below examples where the analysis results reflect timeout values and retry counts.
Another example can include any of the above and/or below examples where the analysis results reflect congestion notification behavior.
Another example includes a programmable network device comprising a plurality of ports, and a programmable logic circuit configured to receive event parameters of events to inject into network traffic communicated between two or more network adapters connected to two or more of the ports, inject the events into the network traffic, distribute the network traffic having the injected events between the two or more network adapters, and mirror the network traffic to one or more other devices for subsequent analysis.
Another example can include any of the above and/or below examples where the programmable logic circuit is configured to populate a match-action table to inject the events based at least on the event parameters.
Another example can include any of the above and/or below examples where the injected events include at least one of a packet corruption event, a packet drop event, or an explicit congestion notification event.
Another example can include any of the above and/or below examples where the programmable logic circuit is configured to inject the events for a particular queue pair specified by the event parameters.
Another example can include any of the above and/or below examples where the event parameters are received in a configuration file and the programmable logic circuit is configured to parse the configuration file to extract the event parameters from the configuration file.
Another example includes a system comprising a processor, and a storage medium storing instructions which, when executed by the processor, cause the system to configure a programmable network device to inject events into network traffic communicated between two or more hosts having network adapters that perform network stack functionality in hardware, obtain mirrored traffic provided by the programmable network device, the mirrored traffic including the injected events, analyze the mirrored traffic to obtain analysis results, the analysis results reflecting behavior of the network stack functionality in response to the injected events, and output the analysis results.
Another example can include any of the above and/or below examples where the outputting of the results comprises displaying one or more graphical user interfaces that convey at least one of retransmission latency, timeout values, retry counts, or congestion notification behavior.
Another example can include any of the above and/or below examples where the instructions, when executed by the processor, cause the system to display a graphical user interface having elements for specifying event parameters, receive user input directed to the elements of the graphical user interface, the user input identifying particular event parameters for particular events, and configure the programmable network device to inject the particular events according to the particular event parameters.