Programmable data plane event processing systems can be implemented using a variety of devices such as general-purpose processors, field-programmable gate arrays (FPGAs), and domain-specific event processing application-specific integrated circuit (ASIC) designs. Programmable data plane event processors can be used to build network packet processing systems that operate at or near line rate (e.g., an upper rate of egress of packets from a network interface device). In order to avoid possible read or write hazards, some programmable packet processing systems implement read-modify-write operations atomically per-packet (e.g., within a single clock cycle) to perform simple stateful packet header transformations, which can limit the scope of applicable stateful packet processing algorithms.
Various examples described herein include a programmable data plane event processing architecture that can perform stateless or stateful operations. Various examples include a programmable packet or event processing pipeline that performs stateful operations such as multi-instruction or multiple arithmetic logic unit (ALU) operations over multiple clock cycles. One or more packet processing units (PPUs) and/or event processing units (EPUs) of the programmable architecture can include a programmable engine that is capable of performing read-modify-write operations on a set of state variables. One or more EPUs can at least execute very long instruction word (VLIW) instructions to cause processing of an event's metadata fields in series or in parallel.
One or more EPUs can perform stateful event processing on one or more of: global state or flow state. Global state (e.g., global connection state) can be shared across flows and must be updated atomically per-event whereas flow state can be updated atomically between events belonging to the same flow. A flow or group can represent a particular grouping of data plane events. The flow ID (or group ID) can be determined by a subset of the event metadata fields.
State can include per-connection information for reliability and congestion control (e.g., packet sequence numbers). State can include telemetry data, security data, and metadata for outstanding packets (e.g., transmitted packets for which acknowledgement of receipt has been received (not ACKd)). One or more EPUs can perform multiple ALU operations per state update. Event metadata and/or memory data can be updated by each EPU stage. Memory data can include flow state or per-packet state.
Some examples provide a programmable architecture consisting of one or more EPUs. An EPU can perform read-modify-write operations on a set of state variables. At least one EPU can process 1 event per clock cycle. An EPU may utilize one or more programmable compute engines to execute VLIW instructions in order to process multiple event metadata fields in parallel. One or more programmable compute engines may be integrated into an EPU or programmable compute engines may be assigned to each EPU from a disaggregated resource pool at compilation time of an event processing program.
Static random access (SRAM) and content addressable memory (CAM) resources may either be integrated into each EPU or may be allocated to each EPU from a disaggregated resource pool at compilation time of an event processing program. These memory resources may be utilized as an on-chip cache backed by off-chip memory.
For example, when a packet of a flow experiences a cache miss and a second flow experiences a cache hit, processing of the packet of the flow with an associated cache hit can proceed but processing of the packet of the second flow may stall. The programmable pipeline can assign a packet to an ordering domain to enforce ordering between packets within a same flow but allow packets of different flows to bypass at least one packet of a different flow.
The EPU may utilize primitives to provide support for programmable operations on data structures such as linked lists, doubly linked lists, tree structures, and exact match tables. An exact match table can be used to store connection state such as counters, pointers for per-connection data structures, and so forth. The primitives can be used to manipulate data structures: (1) perform memory access pattern for data structures (e.g., two sequentially dependent memory reads followed by an update to the first address), (2) free lists to implement memory allocation and deallocation, and/or (3) compute primitives which can be used to manipulate data structure pointers.
In some examples, linked lists can be used to implement per-flow queues. Nodes used to build the linked list can be dynamically allocated as needed, and a free list can be used to manage available memory handles.
Developers can generate programs that are executed by the pipeline to perform read-modify-write operations on global or general state information and to perform read-modify-write operations on flow state. One or more PPUs or EPUs can perform arithmetic and logical operations that can be composed together. One or more PPUs or EPUs can perform programmable operations on data structures such as linked lists and exact match tables. Exact match tables can support data insertions, deletions, and lookup operations and a programmer can construct linked lists and express operations on the linked lists.
The programmable data plane event processor can be integrated in a packet processing or event processing devices such as a network interface device for programmability of data center transport protocols and for gathering and processing network telemetry metrics.
For example, an event processor can issue memory accesses to a memory pool (e.g., CAM and/or SRAM pool) and package accessed connection context with event data (e.g., header or metadata about a packet (connection ID)) and indicate to an ALU pipeline or pool which program to run and provides connection context with event data to ALU (pipeline or pool) to process, An event processor can identify events that might access same state (same connection ID), complete processing of multiple packets that access the same state in order and queue events of same connection ID to enforce order and separately allow parallel processing of packets of other connection IDs. An event processor can enforce memory access patterns to allow multiple packets with different connection IDs to access different state and can be processed in parallel or dependent handling of packets of same connection ID using free lists or global counters (resource counters). For example, an event can correspond to one or more of: packet arrival, packet is to be transmitted, timer expired (retransmit timer), packet coalescing timer, queue is next to be scheduled, EPU can generate events that control cache content (evict or load). A programmable stateful dataplane can be programmed using an event graph description with event handling executed on different EPUs with parallel access to compute resources and memory resources. Hardware can be allocated to handle memory access patterns scheduled based on connection ID to update state before next event handled for connection ID that might modify the same state, such as read entry and write back, first read and read dependent on results of first read, exact match lookup, or others. A programmable compute can be programmed independent from memory access.
A Cloud Service Provider (CSP) or Communication Service Provider (CoSP) can utilize the programmability and performance of the architecture to implement network transport protocols and/or congestion control for a tenant and its services (e.g., one or more processes, applications, virtual machines (VMs), containers, microservices, and so forth).
For received packets, classification 102 can identify a packet's flow ID and issue a command to cache manager 110 to prefetch flow state at a start of a pipeline of processing the packet. Classification 102 can stall processing of the packet until the corresponding flow state is loaded into caches by cache manager 110. Flow state can be accessed for packet processing in subsequent pipeline stages (e.g., one or more of PPUs 104-0 to 104-N). Classification 102 can assign the packet to an ordering domain and associated ordering queue 112 by hashing the flow ID. Classification 102 can access an exact match table to access global state such as pointer to connection state for a connection, per-packet state, counters, and so forth.
A flow can represent a sequence of packets being transferred between two endpoints, generally representing a single session using a known protocol. Accordingly, a flow can be identified by a set of defined tuples and, for routing purpose, a flow is identified by the two tuples that identify the endpoints, e.g., the source and destination addresses. For content-based services (e.g., load balancer, firewall, intrusion detection system, etc.), flows can be differentiated at a finer granularity by using N-tuples (e.g., source address, destination address, IP protocol, transport layer source port, and destination port). A packet in a flow is expected to have the same set of tuples in the packet header. A packet flow to be controlled can be identified by a combination of tuples (e.g., Ethernet type field, source and/or destination IP address, source and/or destination User Datagram Protocol (UDP) ports, source/destination TCP ports, or any other header field) and a unique source and destination queue pair (QP) number or identifier. A packet may be used herein to refer to various formatted collections of bits that may be sent across a network, such as Ethernet frames, IP packets, TCP segments, UDP datagrams, etc. Also, as used in this document, references to L2, L3, L4, and L7 layers (layer 2, layer 3, layer 4, and layer 7) are references respectively to the second data link layer, the third network layer, the fourth transport layer, and the seventh application layer of the OSI (Open System Interconnection) layer model.
After classification 102, packets can be processed by a pipeline of one or more PPUs 104-0 to 104-N. A PPU can access flow state from cache manager 110. Read stages (e.g., RD0 or RD1) can perform dependent reads from cache manager 110. A head pointer can be read and then first entry in list can be read based on the head pointer. Sequential read stages can perform linked list pop operations (as described herein).
PPUs 104-0 to 104-N can include respective ordering domain queues 112-0 to 112-N that can be allocated to one or more ordering domains. Packets of flows can be mapped to queues 112-0 to 112-N. Queues 112-0 to 112-N can be used to preserve ordering of packets of flow. Packets can be processed by different stages of PPUs and are stored in queues 112-0 to 112-N. A packet can be stored in a queue until a cache is filled with the packet's flow state. Processing of the packet can be stalled in case of a cache miss of flow state. Ordering domain queues 112-0 to 112-N can be used to control packets of a same flow to be processed in first in first out (FIFO) order and to enforce the time spacing between packets of the same flow. Packets that belong to a flow which map to a same ordering domain queue can head-of-line block packets of another flow. Hence, use of ordering domain queues 112-0 to 112-N for a particular flow can reduce head-of-line blocking.
Packets within an ordering domain can be processed in FIFO order and packets of a given flow can be processed in FIFO order. In some examples, packets in different ordering domains or flows can bypass one another so that packets in a first ordering domain or flow can bypass packets in a second, different ordering domain or flow. Allowing packets of different flows to bypass one another can reduce an amount of head of line blocking caused by packets of different flows. If there are more flows than queues, a hash can be used to assign packets of a flow to a queue or load balance queues.
One or more of PPUs 104-0 to 104-N can include one or more read-modify-write circuitry. Read-modify-write circuitry can perform programmable read-modify-write operations on a set of state variables. For example, read circuitry RD0 and RD1 can read state data for a packet from a cache or memory allocated by cache manager 110. Stateful ALU circuitry (ALU) can modify and update state variables, packet header, and metadata fields. An ALU can perform multiple cycles of computation. The read-modify-write circuitry (e.g., RD0, RD1, and ALU) can include two sequential read stages and a stateful ALU module, although other numbers of sequential read stages and stateful ALU modules can be included in a read-modify-write (RMW) circuitry.
RMW on global state data at high rates is challenging because pipeline read, modify, write operations must finish updating the state before processing the next packet uses the state. In some examples, PPUs 104-0 to 104-N can process one packet per cycle and RMW operations on global state can be completed in a single clock cycle. In some cases, a flow has performance target (e.g., packets processed per second) to process one packet/y clock cycles. Some examples of PPU 104-0 to 104-N can perform RMW on flow state updated for packets of a same flow so that y cycles of pipelined operations (over multiple stages) can permitted to finish RMW. In some cases, ordering infrastructure (one or more of queues 112-0 to 112-N) can be used to enforce stalling of another packet of a flow to allow multiple cycles to finish RMW for state processing of a packet of a flow.
Cache manager 110 can manage a pool of one or more caches (e.g., static random access memory (SRAM) caches). One or more cache devices can store flow state read from memory (e.g., dynamic random access memory (DRAM)). A cache can include one read port and one write port to a PPU stage (e.g., one or more of PPU 104-0 to 104-N). The read and write ports for a cache can be assigned to a single PPU at packet processing pipeline program (e.g., Protocol-independent Packet Processors (P4) or others) compilation time. In other words, read-modify-write operations on a given memory address can be performed within a single PPU and not be split across PPUs.
A pool of one or more SRAM and content-addressable memory (CAM) resources can be assigned to one or more PPUs at compilation of a pipeline program (e.g., P4 or others). A write back cache can allow scaling available memory beyond on-chip memory. A CAM resource pool can be used to implement exact-match action tables in some examples to be used to look up connection state or metadata. CAM resources can implement a read and write interface, which can be statically assigned to a PPU at pipeline program compile time. Contents of CAM resource pool can be modified by insertions or deletions.
Free list manager 106 can maintain free lists which can be used to implement resource allocation. For example, free lists can be used to implement dynamic memory allocation for linked list data structures, or to allocate unique packet identifiers. Push and pop interfaces for a free list can be statically assigned to a PPU at pipeline program compilation time. In some examples, free list manager 106 can provide one or more free list addresses per packet and one or more free list addresses can correspond to an address in cache or memory to store read but subsequently modified data such as modified state data. Free list manager 106 can perform pop or push of entries for free lists in cache. Free list manager 106 can be used for dynamic memory allocation.
Classification 102, PPUs 104-0 to 104-N, free list manager 106, CAM resource pool 108, and/or cache manager 110 can be programmed with a pipeline program consistent with one or more of: P4), Software for Open Networking in the Cloud (SONiC), Broadcom® Network Programming Language (NPL), NVIDIA® CUDA®, NVIDIA® DOCA™, Data Plane Development Kit (DPDK), OpenDataPlane (ODP), Infrastructure Programmer Development Kit (IPDK), eBPF, x86 compatible executable binaries or other executable binaries, among others.
A stateful ALU pipeline can include X compute stages (where X is an integer) (e.g., CMP0, 1, 2, 3) to allow a developer to implement (up to) Y-instruction (where Y is an integer) read-modify-write operations on flow state. In order to provide atomicity of stateful operations, a single packet from a given flow can be processed by compute stages at a time. X compute stages can be used to implement X-cycle RMW operations on connection state. The architecture can limit processing of a single packet from a given connection by these compute stages at a time. Atomic processing can refer to an event receiving side effects (e.g., state changes, metadata changes) caused by previous events.
A compute stage (e.g., one or more of CMP0, 1, 2, 3) can include compute ALUs, comparison ALUs, and programmable logic for Boolean algebra. The compute ALUs can perform simple arithmetic or bitwise operations. The comparison ALUs can perform comparison operations and produce a Boolean value to indicate the result. The programmable logic for Boolean algebra can use a programmable logic array (PLA) to compute new predicates and in turn tells the crossbar how to update the operands for the next stage. Bool algebra (Alg) can perform boolean arithmetic on outputs from compute stages.
ALUs (e.g., one or more of ALU0, 1, 2, or 3) can support instructions for packet sequence number (PSN) arithmetic and bitmap operations to take into account that PSN values can wrap around. ALU operations based on instructions for transport protocols: bitmap operations, boolean, add, subtract, first set bit, and others.
A bypass path (e.g., Stage N−1 data and Stage 0 Data) can be used to support single cycle RMW operations, which is used to implement updates on global state that is shared across connections. Bypass paths support single cycle and N-cycle read-modify-write operations. Single cycle read-modify-write operations can be used to update global state that is shared across flows. Bypass paths can be added to implement stateful operations with different performance requirements. For example, a bypass line can permit a single clock cycle operation on global state so that read-modify-write occur atomically on multiple packets.
Outputs from a stateful ALU can include update metadata and returning unclaimed (or freed) free list addresses. In addition to returning unclaimed (or freed) free list addresses (e.g., Freelist Addr 0-3), the stateful ALU can output two (or other numbers) of RAM write commands (RAM Word 0 and 1 and RAM Addr 0 and 1), which can be performed in parallel as long as they target different memories to push new entries onto a linked list. Freelist addresses 0 and 1 can be pre-allocated to packet and Freelist addresses 2 and 3 can be freed for packet. The use of two is merely an example and other numbers can be used other than two.
General or global state can represent state shared between multiple different flows. In some examples, a PPU can execute single-instruction read-modify-write operations on general or global state, such as incrementing or decrementing global counter statistics to count outstanding packets. One or more PPUs of a programmable pipeline can execute multi-instruction or multiple ALU operations over multiple clock cycles to perform read-modify-write operations on flow state data, such as general or global state.
For example, a programmable pipeline can perform transport protocol logic for a flow. Multi-instructions or multiple ALU operations over multiple clock cycles can be performed on connection state. In some cases, performance goals for a single connection processing speed are less than line rate.
The code segment below shows an example of RMW operation on connection state in a sequence to update a receiver's sliding window as packets arrive over the network. A sliding window can represent a window of packets that a receiver is currently able to process. For example, arriving packets whose packet sequence number (PSN) falls before the window have already been received and hence are duplicates and packets whose PSN falls beyond the window have arrived too far out of order for the receiver to handle.
In this example, 5 ALU operations can be performed by one or more PPUs to modify connection state. Connection state can represent protocol-specific state variables used by the connection to implement tasks such as reliable delivery, congestion control, resource management, etc. A stateful operation can be implemented atomically between packets of the same connection.
In some examples, linked lists can be used to implement per-flow queues. For some linked lists, push and pop operations do not involve multiple reads from or writes to the same memory and an empty linked list may not be allocated to a node. A node can represent a memory address. Nodes used to build the linked list can be dynamically allocated as needed, and a free list can be used to pass available memory handles.
Note that the linked list head (LL_head) and tail pointers (LL_tail) and the actual nodes can be stored in memories and thus these writes can occur in parallel.
To pop an entry from the linked list, two memory reads to two different memories can be performed: read head pointer and read node that head pointer points-to. Two sequentially dependent memory reads can be performed when popping a head entry off the linked list: fetch the head pointer (LL_head) and then fetch the node that the head pointer points to in order to move the head pointer forward. The head pointer can be updated using the result of the second read operation. Note that these two read operations can be pipelined because they are issued to separate memories.
In some examples, as shown, three memory accesses can be performed for push and pop operations on the linked list. However, more linked list operations can be supported than push and pop. For example, developers can write pipeline programs that can push to either the front or back of a linked list, or implement a go-back-N queue, such as used for transport protocol implementations such as remote direct memory access (RDMA) over Converged Ethernet (RoCE).
One or more instances of a stateful programmable pipeline 402 can be used. For example, an instance of a stateful programmable pipeline 402 can process packets on transmit and receive and another instance of a stateful programmable pipeline 402 can process queueing related events. Pipelines can process one event per cycle (e.g., 1 billion events/sec). Programmable queue management pipeline 404 can manage transmit (TX) or receive (RX) queues and enforce a programmable congestion control policy.
Programmable queue management 404 that can be implemented using similar programmable primitive utilized for programmable packet processing pipeline 402. Programmable queue management 404 can utilize primitives for implementing scheduling decisions amongst queues, as well as primitives for implementing the memory access pattern and memory allocation required for linked lists. A programmer can use these primitives to configure utilization of a queue data structure and decide how to enable/disable queues for scheduling.
Programmable queue management 404 can manage a connection's transmit and receive queues and enforce a congestion control policy by marking queues as either active or inactive. Queue management 404 can process queueing events such as packets to enqueue, scheduling events, or congestion control state update events.
Protocol state can be cached in on-chip static random access (SRAM) or other memory and backed by Double Data Rate (DDR) memory 406 or other memory. Protocol state can be used for implementing reliable packet delivery, congestion control, telemetry, etc.
Configurable scheduling 408 can schedule packets for transmission from active queues and can generate scheduling events to be processed by programmable pipeline 402 to perform a configurable scheduling policy to arbitrate across queues that have been marked as active by programmable queue management 404. Scheduling 408 can generate scheduling events that indicate the selected connection and queue identifier (ID). Programmable queue management 404 can process the scheduling event and fetch the packet state from the corresponding connection and queue ID. Scheduling 408 can implement a configurable, hierarchical scheduling policy to schedule packet transmissions from amongst the active queues.
Scheduling 408 can schedule packet transmission from among the active queues and generate scheduling events for the programmable queue management. Upon processing a scheduling event, programmable pipeline 402 can determine if a packet is to be transmitted from the indicated queue. If so, programmable pipeline 402 can read a packet descriptor from the indicated queue and cause transmission of the corresponding packet from packet buffer 412. Packets transmitted from packet buffer 412 can be processed by programmable pipeline 402 again before transmission to the network. Depending on the protocol logic, the packet may remain buffered, and the packet descriptor may remain in the transmit queue in order to facilitate retransmissions if needed. Upon being successfully acknowledged, the packet and descriptor can be freed for reuse.
General purpose embedded processor cores 410 can be configured to process low event rate processing, such as connection management and processing congestion signals.
Packet buffer 412 can store packet header, data, and metadata as well as scheduling timer events. For reliable transport, packet buffer 412 can store packet data until it the packet data was successfully delivered to a remote endpoint. Packet buffer 412 can store packets to be re-transmitted in an event of an indication that a packet was not received (e.g., negative acknowlegement (NACK) or no receipt of an ACK within a timed interval. Timer events can be processed by the programmable pipeline are used to implement tasks such as generating packet retransmissions, performing ACK coalescing, and generating probe packets.
When a protocol engine (e.g., RDMA PE 502) generates a packet to be transmitted on a given connection, the packet can be processed by the programmable packet processing pipeline (e.g., PTA 504). PTA 504 can perform operations such as allocating buffer resources for the packet, assigning a packet sequence number, and other protocol-specific operations. The packet can be buffered and, in parallel, processed by programmable queue management, as described herein. Programmable queue management can insert a packet descriptor into the appropriate transmit queue and, if the congestion control policy allows it, mark the queue as active. Transmit queues can be implemented as linked lists in cacheable memory.
RDMA protocol engine (PE) 502 can implement the InfiniBand Verbs application interface, and programmable transport architecture (PTA) 504 can provide reliability and congestion control for the packet generated by an RDMA PE 502. PTA 504 can provide sufficient programmability to support various data center transport protocols. Examples of transport protocols include at least: remote direct memory access (RDMA) over Converged Ethernet (RoCE), RoCEv2, Amazon's scalable reliable datagram (SRD), Amazon AWS Elastic Fabric Adapter (EFA), Microsoft Azure Distributed Universal Access (DUA) and Lightweight Transport Layer (LTL), Google GCP Snap Microkernel Pony Express, High Precision Congestion Control (HPCC) (e.g., Li et al., “HPCC: High Precision Congestion Control” SIGCOMM (2019)), improved RoCE NIC (IRN) (e.g., Mittal et al., “Revisiting network support for RDMA,” SIGCOMM 2018), Homa (e.g., Montazeri et al.,“Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities,” SIGCOMM 2018), NDP (e.g., Handley et al., “Re-architecting Datacenter Networks and Stacks for Low Latency and High Performance,” SIGCOMM 2017), and/or EQDS (e.g., Olteanu et al., “An edge-queued datagram service for all datacenter traffic,” USENIX 2022).
Non-limiting examples of PTA 504 are described with respect to
For packets to be transmitted from PTA 504, packet processor 506 can perform additional packet processing such as encapsulation or decapsulation for network virtualization, traffic shaper 508 can pace transmission rate of packets into a network, packet builder 510 can fetch packet data from host memory to build outgoing packets, encryption/decryption 512 can perform encryption of packets prior to transmission to a network using network interfaces 514.
For packets received from a network by network interfaces 514, encryption/decryption 512 can perform decryption of packets and body segment storage (BSS) 516 can store packets prior to processing by PTA 504.
A developer can program PTA by defining an event graph. An event graph can represent stateful data plane operations as a data flow graph in which nodes perform event processing and edges indicate how events flow between the nodes. For example, an event graph can represent operations of a transport protocol. Multiple event graphs may be compiled and loaded onto PTA simultaneously in order for PTA to run multiple transport protocols at the same time.
RDMA PE can provide inputs to a multiplexer of a metadata and associated packet to be transmitted (e.g., ULP2PTA Pkt) as well as acknowledge (or negatively acknowledge) successful processing of packets that PTA delivered (e.g., ULP2PTA ACK) and received packet and associated metadata that an ingress pipeline delivers to PTA (e.g., Net2PTA Pkt).
In some examples, PTA includes a pipeline of one or more programmable Event Processing Units (EPUs) 550-0 to 550-A. EPUs can be organized as a pipeline such that events produced by one EPU flow to the subsequent EPU in the pipeline. EPUs 550-0 to 550-A can include programmable event processing engines that perform memory accesses, while enforcing atomicity when required. EPUs 550-0 to 550-A can include hardware to perform atomic memory accesses. EPUs 550-0 to 550-A can process data-plane events according to a user-specified event processing program. An EPU can process events (e.g., a collection of metadata) and may produce zero or more new events. The user-specified packet processing program can specify operations of a transport protocol. In some designs, an EPU can be statically assigned memory and compute resources from ALU core pool, SRAM pool, and/or CAM pool at program compilation time.
An EPU can include programmable and reconfigurable circuitry, used to implement a user-defined node in an event graph, such as by a CSP or tenant. An EPU can receive an incoming event, retrieve memory entries corresponding to that event, and dispatch event and memory data to a programmable compute engine. The programmable compute engine can execute a program to modify the event and memory data. The programmable compute engine can update event and memory entries before the event is passed to the next node. In some examples, compute circuitry (e.g., circuitry to update event metadata and/or memory data) can be included within a EPU or disaggregated in a global pool of compute resources. A programmable compute engine may be integrated into an EPU or may be located in a disaggregated resource pool that is shared across multiple EPUs. The programmable compute engine may be implemented as a pipeline of configurable ALUs, as shown in
In some examples, an EPU can process up to one event per clock cycle, or other numbers of events per clock cycle. An EPU can simultaneously bypass one or more events per clock cycle that are to be passed through the EPU (to forward one or more events to one or more different EPUs). PTA may leverage an event switch to route events between one or more event processors.
Memory and compute pools available to EPUs 550-0 to 550-A can include: SRAM pool 554 and CAM pool 556 for exact-match table lookups of connection contexts, and ALU core pool 558 to perform data plane event processing. One or more processors in ALU core pool 558 can be allocated to each EPU to perform data plane event processing. An ALU core can execute VLIW instructions for bitmap operations to evaluate Boolean expressions, and other compute tasks.
SRAM pool 554 can include a pool of SRAM or other memory resources that are statically assigned to EPUs at program compilation time based on memory requirements defined in the program. In order to avoid synchronization issues, an SRAM can be assigned to at most one EPU (e.g., multiple EPUs may be prevented from accessing the same SRAM simultaneously). SRAMs can store protocol state such as per-connection state for cached connections.
CAM pool 556 can include CAM resources that can be statically assigned to EPUs at program compilation time based on memory requirements defined in the program. In order to avoid synchronization issues, a CAM can be assigned to at most one EPU (e.g., multiple EPUs may be prevented from accessing the same CAM simultaneously). CAMs can be used to implement exact match tables to map a unique connection ID to a connection cache index.
ALU core pool 558 can include a pool of VLIW processors to process event and memory data with very low latency (e.g., a dozen clock cycles). A core can be statically assigned to an EPU when the event graph program(s) are compiled and loaded onto PTA. Cores can assigned to one or more EPUs based on the compute requirements of the event graph program(s).
DRAM interface 552 can process events related to caching and can fetch and evict protocol state between local SRAM (e.g., SRAM pool 554) and off-chip DRAM. DRAM interface 552 can implement protocol state caching. For example, DRAM interface 552 can process events to evict the necessary connection state from SRAM into DRAM, as well as to load connection state from DRAM into SRAM. DRAM interface 552 can generate a cache fill event to be processed by the EPUs once a cache load is complete.
Tx scheduler 560 may schedule packet transmission amongst the system's transmit queues. Rx scheduler 562 may schedule delivery of packets to the upper layer processor (ULP) from the system's receive queues (or reorder queues). ULP can provide an interface to an application (e.g., virtual machine (VM), container, process, microservice, and so forth) running on a host server. For example, the RDMA ULP can implement an InfiniBand (IB) Verbs interface to applications. Other ULPs can implement other application interfaces (e.g., sockets, Message Passing Interface (MPI) send/receive, Remote Procedure Call (RPC) request/response, etc.)
Miss queue (Q) scheduler 564 can make configurable hierarchical scheduling decisions for packets that experienced connection cache misses. Miss Queue Scheduler 564 may schedule packets from miss queues, which store packets that experienced a connection cache miss until the connection state is loaded from DRAM. Schedulers can maintain an amount of state to track which queues (e.g., TX queues, RX queues, and miss queues (not shown)) are eligible for scheduling. TX scheduler 560, RX scheduler 562, and miss queue scheduler 564 can implement a configurable, hierarchical scheduling policy to generate scheduling events for associated queues. Scheduling eligibility of various queues may be updated upon processing events that are generated by the EPUs.
Timer event scheduler 566 can include a configurable scheduler used to schedule events based on time. Timer event scheduler 566 can be configured to generate an event periodically for cached connections that have the event enabled, which can be useful for initiating timeout -based packet retransmissions, implementing ACK coalescing, or other time-based tasks. Timer event scheduler 566 can be configured to generate an event periodically for cached connections that have the event enabled. For example, timer event scheduler 566 can initiate timeout-based packet retransmissions, implement ACK coalescing, or other time-based tasks. In some examples, timer event scheduler 566 can support multiple timer event types (per cached connection).
Work conserving scheduler 567 can arbitrate among events for events arising from packet transmit (TX) queues, packet receive (RX) queues, or miss queues. Work conserving scheduler 567 can select events among multiple different classes of events based on configured scheduling policy (e.g., weighted round robin, round robin, strict priority, or others). Work conserving scheduler 567 can schedule events in a work conserving manner to attempt to keep EPUs busy.
CPU interface 568 can implement a shared memory queue interface with software running on one or more general purpose processors or embedded cores. Software running on the embedded cores can implement a congestion control algorithm such as Swift, HPCC, or algorithm defined by the CSP or tenant. Software can produce response events which may indicate updated congestion control parameters (e.g., congestion window, transmission rate, etc.). The embedded cores may also run control plane software to handle connection setup, exception processing, etc. For example, a control plane executed on a network interface device and/or host server can manage the data plane running in PTA to cause connection setup, handle runtime errors, etc.
Packet buffering, parsing, and editing 570 can store packet data and metadata until it is no longer needed. For instance, until the remote host ACKs the pkt and it no longer needs to be retransmitted. For example, a packet can be stored until it is explicitly freed by an EPU-generated event (e.g., after the packet has been successfully delivered to the remote host or local ULP).
PTA system can provide outputs of: packet and metadata that PTA delivers to an egress pipeline for transmission (e.g., PTA2Net Packet); packet and metadata that PTA delivers to the ULP (e.g., PTA2ULP Packet); completion messages to the ULP upon successful (or unsuccessful) delivery of packets to the remote host (e.g., ULP Completion); and return flow control credit to the ULP (e.g., ULP Credit Return). Outputs from PTA system can be provided to an egress pipeline or ULP.
For example, event processing nodes can include one of more of the following. Egress Pipe Input can produce a TX Packet Event for an outbound packet being transmitted by the network interface device and initializes event metadata fields upon event generation (e.g. connection ID, packet sequence number). Egress Pipe Output can consume a TX Packet Event which includes control metadata fields that can be set by user-defined nodes in the event graph to affect subsequent processing of the packet in the NIC's egress pipeline.
Ingress Pipe Input can produce an RX Packet Event for each inbound packet received by the NIC over the network and initialize event metadata fields upon event generation. Ingress Pipe Output can process an RX Packet Event which includes control metadata fields that can be set by user-defined nodes in the event graph to affect subsequent processing of the packet in the NIC's ingress pipeline.
For example, functionality can include: track an average RTT for each connection (conn.avg_rtt) as well as acknowledgement packets contain timestamp values that can be used to compute RTT measurements for a connection and use these timestamps to compute an instantaneous RTT measurement and update an exponentially weighted moving average RTT for the connection. For example, functionality can include track the number of retransmitted pkts for each connection over a recent window of time (conn.retx_count). TX Packet Event metadata indicates if the outbound packet is a retransmission and a current clock time. The number of packet retransmissions can be counted for each connection within a configurable window of time. The count can be reset when moving to a new time window. The total number of outstanding packets across connections at the host (total_outstanding_pkt_count) can be tracked.
A global state variable that is shared across connections (total_outstanding_pkt_count) can be tracked. The total_outstanding_pkt_count can be increment for each new (non-retransmission) TX packet or decremented when processing ACKs from the network. For example, the following pseudocode can be applied to detect congestion and potentially change a network path for packets. Operations can be split across multiple user-defined nodes.
Events in event queues may be scheduled (e.g., event queue scheduler 904) in round-robin, weighted round-robin, or other order (e.g., first-on-first out). Groups of queues may be given higher weighting or priority in the scheduler, e.g., ODID ranges can be used to represent different protocols with different priority. Events to process can be chosen from those at the head of an input queue, that do not have another event of the same ODID currently being processed in the same EPU, and that are not marked for bypass. Events to bypass can be scheduled for processing in a similar manner, except that they are marked for bypass. An event can be marked for bypass if its Bypass Count (BC), set in the last processing node, is nonzero. A BC can be decremented after every bypass. There could be multiple bypass schedulers, a bypass scheduler can choose a bypass event per cycle and potentially process separate groups of queues.
Event queue scheduler 904 can schedule processing of events to enforce atomic state updates. For example, an EPU may wait to process an event belonging to an ODID until the previous event belonging to the same ODID is complete.
Control 905 can store rules to configure other blocks within the EPU to process an event. Control 905 can include a CAM table that matches on the event type and other event metadata. Table entries can be configured at program compilation time and indicate event processing configuration information such as one or more of: Table ID to access (if any); for direct index tables, which event metadata field to use as the table index; for exact match tables, which event metadata field(s) to use as the table key; whether a second table access is required and if so, the table ID to access and which event metadata field or table 1 entry field to use as the index for second table 2 (e.g., memory access to another table in a linked list to be used by lookup 906); starting program counter (PC) that the ALU core should use to process this event; which event metadata fields to pack into registers; which table entry fields to pack into registers; how to update table entry from final register state; and/or how to update event metadata from final register state.
Lookup 906 can fetch memory entries from memory pool. Some memory entries may be directly indexed by the ODID, or by another table index carried in the event. Some memory entries may be accessed via chained lookups whereby an index extracted from a looked-up entry may be used for a further lookup in a different table to access a data structure such as a linked list. Lookup 906 can support at least two chained lookup operations, such as, lookup to table A gives the index of table B to lookup. This feature can support a memory access pattern of linked lists. Lookup can support prefetching of table entries, such as reading ahead to the next entry in a linked list.
Register packing 908 can pack or load event metadata fields and table entry fields into the register slots that can be dispatched to an ALU core for processing. Register packing 908 can perform register packing using configuration information provided by control block. Register packing 908 can dispatch the packed registers and starting program counter to an ALU core based on instruction from ALU core scheduler.
ALU core scheduler 910 can determine how to dispatch events to ALU cores for processing. ALU core scheduler 910 can be configured with a set of cores that are assigned to the EPU at program compilation time. ALU core scheduler 910 can track status of whether one or more ALU cores are idle or busy. If a core is busy, ALU core scheduler 910 can track the ODID corresponding to the event that the core is processing. When a new event is ready for processing (e.g., after the registers have been packed), ALU core scheduler 910 can select an idle core and instruct register packing module to dispatch the event to the selected core. A core can indicate when event processing is complete and the core scheduler instructs the core when to dispatch its final register state to register unpacking module. ALU core scheduler can provide a completion indication back to event queue scheduler 904 indicating that another event can be scheduled with a same ODID.
One or more ALU cores in compute pool 912 can include a processor to complete calculations in an event graph node. ALU core can include partitionable ALU with VLIW dispatch; capable of a wide (64 b) operation or multiple narrow (16/32 b) operations in a single cycle; support Boolean expressions (e.g., complex expressions on up to 8 input bits (which may be any 8 bits from any registers) calculable in a single cycle; perform bitmap handling (e.g., find-first-zero, set/clear of individual bits on wide bitmaps); perform single-cycle load and unload of threads (event nodes); and so forth.
Register unpacking circuitry 914 can use the final register state provided by the ALU core to: (1) update one or more table entries, (2) update event metadata, (3) update global, freelist, and policer states. After updating event metadata, register unpacking circuitry 914 can forward the event to the next EPU. Register unpacking circuitry 914 may update the event's ODID and/or bypass count before forwarding the event. Register unpacking circuitry 914 can also resubmit the event back into the current EPU's input event queues for additional processing if needed.
Read-Modify-Write memory bypass 916 can provide a write-through cache for table entries. Read-Modify-Write memory bypass 916 can store recently accessed table entries so that they can be accessed again with lower latency than would otherwise be if the table access reached the memory pool.
Globals and freelists can store state that may need to be accessed and updated atomically between events (e.g., across ODIDs). Globals can support N state variables, which can be accessed and updated using a set of opcodes (e.g., increment or decrement). Freelists block can support N freelists, which are initialized at compile time. Freelists can be used to, for example, assign unique IDs to packets to maintain per-outstanding-packet state and/or dynamically allocate/deallocate data structure nodes (e.g., linked list nodes). Freelists can support a small set of opcodes to push and pop entries.
The following paragraphs describe how an EPU may be used to implement an example user-defined node in an event graph. An EPU can implement a user-defined node that performs two tasks: (1) assigns PSNs to outgoing request packets, and (2) keeps track of the total number of outstanding packets. This node will process 2 types of events: kUlpRequest—Corresponds to an outgoing request packet and kNetAck—Corresponds to an ACK packet received from the network. ACK packets cumulatively acknowledge packets up to the PSN indicated in the ACK packet.
Pseudocode for the event processing logic implemented by this node is shown below.
In the above example, conn_state_ is a table that maintains connection state and is indexed by an event metadata field called conn_cache_idx. It is assumed that a previous EPU computed the connection cache index (conn_cache_idx) for this event and recorded the value in the event metadata. An entry of the conn_state_table can include two state variables: request_psn (e.g., indicates the PSN to assign to the next outgoing request packet), and oldest_outstanding_psn (e.g., tracks the oldest PSN that has not yet been acknowledged). Variable num_outstanding_pkts_ is a global state variable that is shared across connections and can indicate a total number of outstanding (e.g., transmitted but not yet acknowledged) request packets.
An example of operations of an EPU can be as follows. At (1), classify classifies an arriving event from another EPU or the current EPU into an event queue. A kUlpRequest event arrives at the EPU. In this case, the event is tagged with Ordering Domain Identifier (ODID)=connection cache index. Classifier 902 can assign the event to an input event queue based on a hash of the ODID.
At (2), event queue scheduler 904 schedules an event for processing. The event queue scheduler schedules the event for processing after ensuring that there are no other events with the same ODID currently being processed by the EPU.
At (3), control 905 can determine an EPU control configuration by event type and metadata. Control 905 can look up the rules for processing the event based on the event type. Control 905 can instruct lookup 906 to issue a read for table conn_state_ at index event→conn_cache_idx and inform register packing 908 how to pack the event metadata and table entry data into ALU core registers, as well as the starting program counter (PC) for the ALU core. Control 905 can instruct register unpacking 914 how to use the final ALU core register state to update the event metadata and table entry.
At (4), lookup 906 can perform lookup of table entry(s) for the event. Lookup 906 can issue a read to the conn_state_table at index event identified by conn_cache_idx. Upon completing the read, lookup 906 can forward the table entry to the register packing module.
At (5), select event and memory/pack registers 908 can load table entry(s) and event metadata into registers for processing. For example, table entry(s) and event metadata can be loaded into 31 16 bit registers. Table entry(s) can include protocol state (e.g., connection context). Register packing 908 can pack part of the conn_state_table entry (e.g., request_psn, which is 32-bits) into two 16-bit register slots.
At (6), ALU core scheduler 910 can select an ALU core to perform processing of the event. ALU core scheduler 910 can select an ALU core to dispatch the event to. Upon core selection, ALU core scheduler 910 can instruct the register packing module to dispatch the packed registers as well as starting PC to the selected core.
At (7), the selected ALU core can execute a routine and/or perform a fixed function operation to process the event. Examples of events are described herein and can be specified by a developer or CSP or CoSP administrator. The packed registers can be loaded into the register file of the selected ALU core which then executes the program indicated by the starting PC. In this example, the ALU core can execute a sequence of instructions that record the PSN to assign to the packet, increment the request_psn, load an opcode into the register file that defines how to update num_outstanding_pkts_global state, and set a control & status register (CSR) indicating that the program is complete.
At (8), register contents can be used to update event data and table entry(s). ALU core scheduler 910 can identify that the core has finished processing the event and instruct the core to dispatch its final register state to register unpacking 914. Register unpacking 914 can issue the write to update the conn_state_table with the new request_psn value from the register state, issue the provided opcode to the globals module to increment the num_outstanding_pkts_state, copy the packet PSN from the register state to the event metadata, copy the final value of the num_outstanding_pkts_state into the event metadata, and forwards the updated event metadata to the next EPU.
At (9), another event with a same ordering domain ID can be dispatched from the event queues for processing. In some examples, an atomicity guarantee can be achieved for accesses to protocol state. After register unpacking module has issued the write to update the conn_state_table, ALU core scheduler 910 can deliver a completion to the event queue scheduler which enables it to schedule another event with the same ODID (e.g., another event that accesses the same conn_cache_idx in the conn_state_table).
To attempt to make efficient use of memory bandwidth and compute resources, EPU can decouple memory accesses and compute resources and uses specialized hardware to schedule each separately. EPU makes efficient use of memory bandwidth by carefully scheduling events to process that are not in danger of a read/write hazard. It is also optimized for memory access patterns that are common amongst stateful data plane applications; namely, simple table lookups and short, bounded linked list traversals. The EPU memory lookup engine can be configured to prefetch linked list nodes in order to enable high performance operations on the data structure.
ALU cores may not support instructions to load data from memory, which means they never need to stall waiting for a load to complete. The memory accesses associated with processing an event are performed before a thread is launched to process the event. This means the core can focus solely on issuing compute instructions to process an event while, at the same time, dedicated hardware issues memory accesses for other events.
In many stateful data plane applications, events belonging to a single flow (e.g. a single transport connection) need to access the same set of state variables. In order to maximize the rate at which events from a single flow can be processed and reduce Read-Modify-Write latency overhead, the EPU attempt to reduce the latency overhead of the read-modify-write loop. In order to do this, the EPU design may not allow tables to be shared across EPUs to avoid the need to arbitrate for table access and makes the access latency more predictable and use a cache of recently accessed (or prefetched) table entries.
In order to support a large class of stateful data plane applications, compute operations that are used to update event data and memory data can be programmable. In order to enable this, the ALU cores use a set of simple RISC instructions that are not specific to a particular application. In addition, the EPU supports a set of instructions to manipulate global state that can be applicable across various data plane applications.
EPU may not include its own local/dedicated compute and memory resources, but utilize a pool of resources allocated based on the compute and memory parameters of the program being implemented. An EPU may not be provisioned with compute and memory resources required for a worst case node.
The following provides an example of PTA ALU core instruction set.
CSPs and CoSPs can deploy datacenter transport protocols that perform reliable (or unreliable) packet delivery over the network and congestion control. Table 1 provides an example description of various transport protocol aspects.
A transport protocol can be used to deliver data between applications over a network. A transport protocol to use in a data center depends on network properties such as one or more of: buffer sizes, bisection bandwidth, round trip time (RTT), in-network support for congestion control such as Explicit Congestion Notification (ECN), in-network telemetry (INT) (e.g., Internet Engineering Task Force (IETF) draft-kumar-ippm-ifa-01, “Inband Flow Analyzer” (February 2019)), packet trimming and priority queueing and workload properties (e.g., message size distribution, burstiness, amount of incast, application message ordering requirements and performance goals).
Transport protocols that are implemented in fixed-function hardware (e.g., RDMA network interface controllers can implement a RoCE protocol) can provide high performance but may not be able to be re-designed or modified after the fixed-function hardware has been taped out.
At least to provide a flexible and configurable transport protocol, a programmable event processing architecture with scheduling circuitry, packet buffering, and processors can perform at least congestion control and reliable packet delivery. The programmable event processing architecture with scheduling circuitry, packet buffering, and processors can support of one or more of: support for packet reordering tolerance, selective retransmissions, window-based congestion control, and receiver-side congestion control. Cloud Service Providers (CSPs) can design and deploy custom datacenter transport protocols that are suited for their workloads and networks using the programmable event processing architecture. In addition, CSPs can use the platform to deploy custom data plane applications that monitor network health or host application performance, then provide useful metrics for control plane management.
A platform that provides programmability of transport protocols does not need to contain dedicated silicon for specific transport protocols. A transport protocol can be represented as a separate program and memory and compute resources can be flexibly allocated at compile time based on program requirements. CSPs can allocate a platform's resources to the set of programs to support. For example, resources need not be utilized for an Internet Wide Area RDMA Protocol (iWARP) protocol implementation if the CSP does not utilize iWARP in its network.
An upper protocol engine can provide an interface to applications. In some examples, an RDMA protocol engine can implement the InfiniBand Verbs interface and provide an interface to applications as well as the associated packetization such as splitting up a large message into maximum transmission unit (MTU) sized packets. A programmable event processing architecture with scheduling circuitry, packet buffering, and processors can then perform a configured and potentially custom reliable delivery and congestion control for packets generated by the upper protocol engine.
A programmable event processing architecture, described herein, such as PTA, can be configured to perform reliable packet delivery and congestion signal collection by analyzing packet header fields. A transport protocol's reactive congestion control algorithm (e.g., Swift, HPCC, etc.) can be implemented using programmed embedded cores. Collected congestion signals (and relevant connection state) can be sent to one or more embedded cores via in-memory mailbox queues. The cores can process congestion control events and return commands to update the connection state (e.g., congestion window (CWND) or transmission rate). A sender can adjust its transmit rate by adjusting a CWND size to adjust a number of sent packets for which acknowledgement of receipt was not received. Commands can be processed by programmable queue management to update the connection state and enforce the congestion control decisions. Programmable queue management can provide primitives to implement a wide range of queueing data structures including first in first out (FIFO) queues, go-back-N queues, or reorder queues.
The following provides example event processing nodes.
Conn CAM
Maintains global state: conn_cam, e.g., an exact match table that maps connection ID to connection cache index. This table can contain at most 8K entries (e.g., 8K connections fit in the cache/on-chip SRAM).
Consumes events:
Generates events:
Admission Check (and Eviction Selection)
Maintains global state:
Maintains the following connection cache state:
Miss Queue Management
PSN Assignment
Tx Queue Management
RUE State
Generate ACK
Maintains the following connection cache state:
Consumes the following events:
Generates the following events:
The event graph abstraction can be used to represent a transport protocol using fixed-function and user-defined nodes. An event graph implementation can define functionality of user-defined nodes and connectivity of an event graph. Edges can represent data-plane events. The following describe examples of events.
An example reliable transport (RT) protocol can be performed by use of PTA. A summary of example Initiator-side logic can be as follows:
A summary of example Target-side logic can be as follows:
Reference is made to 1204 for an RT loss recovery flow. In this example, PSN=2 is lost in the network. Upon receiving PSNs 3 and 4, the target PTA drops these packets because they fail the expected PSN check. Eventually, the retransmission timer expires and packets 2, 3, and 4 are retransmitted by PTA, without the involvement of the initiator ULP.
At 1304, the programmable event processing architecture can perform operations based on the event graph description. For example, the plurality of programmable event processors can perform memory accesses separate from compute operations. For example, the plurality of programmable event processors can group events into at least one group. For example, the plurality of programmable event processors are to enforce atomic processing of other events within a group of the at least one group. In some examples, the atomic processing includes propagation of state changes to among events of the group. In some examples, the plurality of programmable event processors are to perform parallel processing of events belonging to different groups.
Network interface 1400 can include transceiver 1402, processors 1404, transmit queue 1406, receive queue 1408, memory 1410, and bus interface 1412, and DMA engine 1452. Transceiver 1402 can be capable of receiving and transmitting packets in conformance with the applicable protocols such as Ethernet as described in IEEE 802.3, although other protocols may be used. Transceiver 1402 can receive and transmit packets from and to a network via a network medium (not depicted). Transceiver 1402 can include PHY circuitry 1414 and media access control (MAC) circuitry 1416. PHY circuitry 1414 can include encoding and decoding circuitry (not shown) to encode and decode data packets according to applicable physical layer specifications or standards. MAC circuitry 1416 can be configured to perform MAC address filtering on received packets, process MAC headers of received packets by verifying data integrity, remove preambles and padding, and provide packet content for processing by higher layers. MAC circuitry 1416 can be configured to assemble data to be transmitted into packets, that include destination and source addresses along with network control information and error detection hash values.
Processors 1404 can be one or more of: combination of: a processor, core, graphics processing unit (GPU), field programmable gate array (FPGA), application specific integrated circuit (ASIC), or other programmable hardware device that allow programming of network interface 1400. For example, a “smart network interface” or SmartNIC can provide packet processing capabilities in the network interface using processors 1404.
Processors 1404 can include a programmable processing pipeline that is programmable by packet processing program. A programmable processing pipeline can include configurable processing units based on a compiled program, as described herein. Processors, FPGAs, other specialized processors, controllers, devices, and/or circuits can be used utilized for packet processing or packet modification. Ternary content-addressable memory (TCAM) can be used for parallel match-action or look-up operations on packet header content. Processors 1404 and/or FPGAs 1440 can include configurable processing units based on a compiled program.
Packet allocator 1424 can provide distribution of received packets for processing by multiple CPUs or cores using receive side scaling (RSS). When packet allocator 1424 uses RSS, packet allocator 1424 can calculate a hash or make another determination based on contents of a received packet to determine which CPU or core is to process a packet.
Interrupt coalesce 1422 can perform interrupt moderation whereby network interface interrupt coalesce 1422 waits for multiple packets to arrive, or for a time-out to expire, before generating an interrupt to host system to process received packet(s). Receive Segment Coalescing (RSC) can be performed by network interface 1400 whereby portions of incoming packets are combined into segments of a packet. Network interface 1400 provides this coalesced packet to an application.
Direct memory access (DMA) engine 1452 can copy a packet header, packet payload, and/or descriptor directly from host memory to the network interface or vice versa, instead of copying the packet to an intermediate buffer at the host and then using another copy operation from the intermediate buffer to the destination buffer.
Memory 1410 can be any type of volatile or non-volatile memory device and can store any queue or instructions used to program network interface 1400. Transmit traffic manager can schedule transmission of packets from transmit queue 1406. Transmit queue 1406 can include data or references to data for transmission by network interface. Receive queue 1408 can include data or references to data that was received by network interface from a network. Descriptor queues 1420 can include descriptors that reference data or packets in transmit queue 1406 or receive queue 1408. Bus interface 1412 can provide an interface with host device (not depicted). For example, bus interface 1412 can be compatible with or based at least in part on PCI, PCIe, PCI-x, Serial ATA, and/or USB (although other interconnection standards may be used), or proprietary variations thereof.
In one example, system 1500 includes interface 1512 coupled to processor 1510, which can represent a higher speed interface or a high throughput interface for system components that needs higher bandwidth connections, such as memory subsystem 1520 or graphics interface components 1540, or accelerators 1542. Interface 1512 represents an interface circuit, which can be a standalone component or integrated onto a processor die.
Accelerators 1542 can be a fixed function or programmable offload engine that can be accessed or used by a processor 1510. For example, an accelerator among accelerators 1542 can provide compression (DC) capability, cryptography services such as public key encryption (PKE), cipher, hash/authentication capabilities, decryption, or other capabilities or services. In some cases, accelerators 1542 can be integrated into a CPU socket (e.g., a connector to a motherboard or circuit board that includes a CPU and provides an electrical interface with the CPU). For example, accelerators 1542 can include a single or multi-core processor, graphics processing unit, logical execution unit single or multi-level cache, functional units usable to independently execute programs or threads, application specific integrated circuits (ASICs), neural network processors (NNPs), programmable control logic, and programmable processing elements such as field programmable gate arrays (FPGAs) or programmable logic devices (PLDs). Accelerators 1542 can provide multiple neural networks, CPUs, processor cores, general purpose graphics processing units, or graphics processing units can be made available for use by artificial intelligence (AI) or machine learning (ML) models. For example, the AI model can use or include one or more of: a reinforcement learning scheme, Q-learning scheme, deep-Q learning, or Asynchronous Advantage Actor-Critic (A3C), combinatorial neural network, recurrent combinatorial neural network, or other AI or ML model. Multiple neural networks, processor cores, or graphics processing units can be made available for use by AI or ML models.
Memory subsystem 1520 represents the main memory of system 1500 and provides storage for code to be executed by processor 1510, or data values to be used in executing a routine. Memory subsystem 1520 can include one or more memory devices 1530 such as read-only memory (ROM), flash memory, one or more varieties of random access memory (RAM) such as DRAM, or other memory devices, or a combination of such devices. Memory 1530 stores and hosts, among other things, operating system (OS) 1532 to provide a software platform for execution of instructions in system 1500. Additionally, applications 1534 can execute on the software platform of OS 1532 from memory 1530. Applications 1534 represent programs that have their own operational logic to perform execution of one or more functions. Processes 1536 represent agents or routines that provide auxiliary functions to OS 1532 or one or more applications 1534 or a combination. OS 1532, applications 1534, and processes 1536 provide software logic to provide functions for system 1500. In one example, memory subsystem 1520 includes memory controller 1522, which is a memory controller to generate and issue commands to memory 1530. It will be understood that memory controller 1522 could be a physical part of processor 1510 or a physical part of interface 1512. For example, memory controller 1522 can be an integrated memory controller, integrated onto a circuit with processor 1510.
While not specifically illustrated, it will be understood that system 1500 can include one or more buses or bus systems between devices, such as a memory bus, a graphics bus, interface buses, or others. Buses or other signal lines can communicatively or electrically couple components together, or both communicatively and electrically couple the components. Buses can include physical communication lines, point-to-point connections, bridges, adapters, controllers, or other circuitry or a combination. Buses can include, for example, one or more of a system bus, a Peripheral Component Interconnect (PCI) bus, a Hyper Transport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB), or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus (Firewire).
In one example, system 1500 includes interface 1514, which can be coupled to interface 1512. In one example, interface 1514 represents an interface circuit, which can include standalone components and integrated circuitry. In one example, multiple user interface components or peripheral components, or both, couple to interface 1514. Network interface 1550 provides system 1500 the ability to communicate with remote devices (e.g., servers or other computing devices) over one or more networks. Network interface 1550 can include an Ethernet adapter, wireless interconnection components, cellular network interconnection components, USB (universal serial bus), or other wired or wireless standards-based or proprietary interfaces. Network interface 1550 can transmit data to a device that is in the same data center or rack or a remote device, which can include sending data stored in memory.
Network interface 1550 can include one or more of: a network interface controller (NIC), a remote direct memory access (RDMA)-enabled NIC, SmartNIC, router, switch, or network-attached appliance. Some examples of network interface 1550 are part of an
Infrastructure Processing Unit (IPU) or data processing unit (DPU) or utilized by an IPU or DPU. An XPU or xPU can refer at least to an IPU, DPU, GPU, GPGPU, or other processing units (e.g., accelerator devices). An IPU or DPU can include a network interface with one or more programmable pipelines or fixed function processors to perform offload of operations that could have been performed by a CPU. A programmable pipeline can be programmed using a packet processing pipeline program.
In one example, system 1500 includes one or more input/output (I/O) interface(s) 1560. I/O interface 1560 can include one or more interface components through which a user interacts with system 1500 (e.g., audio, alphanumeric, tactile/touch, or other interfacing). Peripheral interface 1570 can include any hardware interface not specifically mentioned above. Peripherals refer generally to devices that connect dependently to system 1500. A dependent connection is one where system 1500 provides the software platform or hardware platform or both on which operation executes, and with which a user interacts.
In one example, system 1500 includes storage subsystem 1580 to store data in a nonvolatile manner. In one example, in certain system implementations, at least certain components of storage 1580 can overlap with components of memory subsystem 1520. Storage subsystem 1580 includes storage device(s) 1584, which can be or include any conventional medium for storing large amounts of data in a nonvolatile manner, such as one or more magnetic, solid state, or optical based disks, or a combination. Storage 1584 holds code or instructions and data 1586 in a persistent state (e.g., the value is retained despite interruption of power to system 1500). Storage 1584 can be generically considered to be a “memory,” although memory 1530 is typically the executing or operating memory to provide instructions to processor 1510. Whereas storage 1584 is nonvolatile, memory 1530 can include volatile memory (e.g., the value or state of the data is indeterminate if power is interrupted to system 1500). In one example, storage subsystem 1580 includes controller 1582 to interface with storage 1584. In one example controller 1582 is a physical part of interface 1514 or processor 1510 or can include circuits or logic in both processor 1510 and interface 1514.
A volatile memory is memory whose state (and therefore the data stored in it) is indeterminate if power is interrupted to the device. An example of a volatile memory include a cache. A non-volatile memory (NVM) device is a memory whose state is determinate even if power is interrupted to the device.
In an example, system 1500 can be implemented using interconnected compute sleds of processors, memories, storages, network interfaces, and other components. High speed interconnects or device interfaces can be used such as: Ethernet (IEEE 802.3), remote direct memory access (RDMA), InfiniBand, Internet Wide Area RDMA Protocol (iWARP), Transmission Control Protocol (TCP), User Datagram Protocol (UDP), quick UDP Internet Connections (QUIC), RDMA over Converged Ethernet (RoCE), Peripheral Component Interconnect express (PCIe), Intel QuickPath Interconnect (QPI), Intel Ultra Path Interconnect (UPI), Intel On-Chip System Fabric (IOSF), Omni-Path, Compute Express Link (CXL), Universal Chiplet Interconnect Express (UCIe), HyperTransport, high-speed fabric, NVLink, Advanced Microcontroller Bus Architecture (AMBA) interconnect, OpenCAPI, Gen-Z, Infinity Fabric (IF), Cache Coherent Interconnect for Accelerators (CCIX), 3GPP Long Term Evolution (LTE) (4G), 3GPP 5G, and variations thereof. Data can be copied or stored to virtualized storage nodes or accessed using a protocol such as NVMe over Fabrics (NVMe-oF) or NVMe (e.g., Non-Volatile Memory Express (NVMe) Specification, revision 1.3c, published on May 24, 2018 or earlier or later versions, or revisions thereof).
Communications between devices can take place using a network that provides die-to-die communications; chip-to-chip communications; circuit board-to-circuit board communications; and/or package-to-package communications. Die-to-die communications can be consistent with Embedded Multi-Die Interconnect Bridge (EMIB), interposer, or other interfaces (e.g., Universal Chiplet Interconnect Express (UCIe), described at least in UCIe 1.0 Specification (2022), as well as earlier versions, later versions, and variations thereof).
Embodiments herein may be implemented in various types of computing, smart phones, tablets, personal computers, and networking equipment, such as switches, routers, racks, and blade servers such as those employed in a data center and/or server farm environment. The servers used in data centers and server farms comprise arrayed server configurations such as rack-based servers or blade servers. These servers are interconnected in communication via various network provisions, such as partitioning sets of servers into Local Area Networks (LANs) with appropriate switching and routing facilities between the LANs to form a private Intranet. For example, cloud hosting facilities may typically employ large data centers with a multitude of servers. A blade comprises a separate computing platform that is configured to perform server-type functions, that is, a “server on a card.” Accordingly, each blade includes components common to conventional servers, including a main printed circuit board (main board) providing internal wiring (e.g., buses) for coupling appropriate integrated circuits (ICs) and other components mounted to the board.
In some examples, network interface and other embodiments described herein can be used in connection with a base station (e.g., 3G, 4G, 5G and so forth), macro base station (e.g., 5G networks), picostation (e.g., an IEEE 802.11 compatible access point), nanostation (e.g., for Point-to-MultiPoint (PtMP) applications), micro data center, on-premise data centers, off-premise data centers, edge network elements, fog network elements, and/or hybrid data centers (e.g., data center that use virtualization, content delivery network (CDN), cloud and software-defined networking to deliver application workloads across physical data centers and distributed multi-cloud environments). Systems and components described herein can be made available for use by a cloud service provider (CSP), or communication service provider (CoSP).
Various examples may be implemented using hardware elements, software elements, or a combination of both. In some examples, hardware elements may include devices, components, processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, ASICs, PLDs, DSPs, FPGAs, memory units, logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. In some examples, software elements may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, APIs, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an example is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints, as desired for a given implementation. A processor can be one or more combination of a hardware state machine, digital control logic, central processing unit, or any hardware, firmware and/or software elements.
Some examples may be implemented using or as an article of manufacture or at least one computer-readable medium. A computer-readable medium may include a non-transitory storage medium to store logic. In some examples, the non-transitory storage medium may include one or more types of computer-readable storage media capable of storing electronic data, including volatile memory or non-volatile memory, removable or non-removable memory, erasable or non-erasable memory, writeable or re-writeable memory, and so forth. In some examples, the logic may include various software elements, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, API, instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof.
According to some examples, a computer-readable medium may include a non-transitory storage medium to store or maintain instructions that when executed by a machine, computing device or system, cause the machine, computing device or system to perform methods and/or operations in accordance with the described examples. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and the like. The instructions may be implemented according to a predefined computer language, manner, or syntax, for instructing a machine, computing device or system to perform a certain function. The instructions may be implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
One or more aspects of at least one example may be implemented by representative instructions stored on at least one machine-readable medium which represents various logic within the processor, which when read by a machine, computing device or system causes the machine, computing device or system to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
The appearances of the phrase “one example” or “an example” are not necessarily all referring to the same example or embodiment. Any aspect described herein can be combined with any other aspect or similar aspect described herein, regardless of whether the aspects are described with respect to the same figure or element. Division, omission, or inclusion of block functions depicted in the accompanying figures does not infer that the hardware components, circuits, software and/or elements for implementing these functions would necessarily be divided, omitted, or included in embodiments.
Some examples may be described using the expression “coupled” and “connected” along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, descriptions using the terms “connected” and/or “coupled” may indicate that two or more elements are in direct physical or electrical contact with each other. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
The terms “first,” “second,” and the like, herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced items. The term “asserted” used herein with reference to a signal denote a state of the signal, in which the signal is active, and which can be achieved by applying any logic level either logic 0 or logic 1 to the signal. The terms “follow” or “after” can refer to immediately following or following after some other event or events. Other sequences of steps may also be performed according to alternative embodiments. Furthermore, additional steps may be added or removed depending on the particular applications. Any combination of changes can be used and one of ordinary skill in the art with the benefit of this disclosure would understand the many variations, modifications, and alternative embodiments thereof.
Disjunctive language such as the phrase “at least one of X, Y, or Z,” unless specifically stated otherwise, is otherwise understood within the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present. Additionally, conjunctive language such as the phrase “at least one of X, Y, and Z,” unless specifically stated otherwise, should also be understood to mean X, Y, Z, or any combination thereof, including “X, Y, and/or Z.”
Illustrative examples of the devices, systems, and methods disclosed herein are provided below. An embodiment of the devices, systems, and methods may include any one or more, and any combination of, the examples described below.
Example 1 includes one or more examples, and includes an apparatus comprising: a network interface device comprising: a programmable event processing architecture comprising a plurality of programmable event processors, wherein: the plurality of programmable event processors are to perform memory accesses separate from compute operations, the plurality of programmable event processors are to group one or more events into at least one group, the plurality of programmable event processors are to perform parallel processing of events belonging to different groups, and the plurality of programmable event processors are programmed to perform at least one transport protocol.
Example 2 includes one or more examples, and includes a process interface to transmit and receive congestion control events and control plane events, wherein a processor is to perform congestion control and/or control plane software.
Example 3 includes one or more examples, and includes at least one circuitry to store packets accessed by at least one of the plurality of programmable event processors to parse and/or edit the packets.
Example 4 includes one or more examples, and includes at least one memory to store protocol state for access by at least one of the plurality of programmable event processors to implement a transport protocol.
Example 5 includes one or more examples, and includes an arithmetic logic unit (ALU) core resource pool to update event data and protocol state.
Example 6 includes one or more examples, wherein comprising a scheduler to arbitrate among connection queues to select packets to forward into the network and/or to an application.
Example 7 includes one or more examples, wherein comprising a scheduler to schedule events to be processed by the plurality of programmable event processors based on time.
Example 8 includes one or more examples, wherein comprising a memory interface to fetch protocol state to on-chip memory and evict protocol state from the on-chip memory.
Example 9 includes one or more examples, wherein comprising a circuitry to route events between the plurality of programmable event processors to allow bypass of a programmable event processor and recirculation of events.
Example 10 includes one or more examples, wherein the programmable event processing architecture comprises one or more of: linear pipeline, linear pipeline with event recirculation support, linear pipeline with bypass support for a programmable event processor, and/or linear pipeline with forward and/or reverse arbitrated buses.
Example 11 includes one or more examples, wherein comprising an event switch to route scheduling related events between the plurality of programmable event processors and at least one scheduler.
Example 12 includes one or more examples, wherein the plurality of programmable event processors are programmed to perform at least one transport protocol based on an event graph description with defined nodes.
Example 13 includes one or more examples, and includes at least one non-transitory computer-readable medium, comprising instructions stored thereon, that if executed by one or more processors, cause the one or more processors to: configure a plurality of programmable event processors of a network interface device programmed to perform one or more transport layer protocols, wherein the plurality of programmable event processors are to perform memory accesses separate from compute operations, the plurality of programmable event processors are to group one or more events into at least one group, the plurality of programmable event processors are to enforce atomic processing of other events within a group of the at least one group, wherein the atomic processing comprises propagation of state changes to among events of the group, and the plurality of programmable event processors are to perform parallel processing of events belonging to different groups.
Example 14 includes one or more examples, and includes instructions stored thereon, that if executed by one or more processors, cause the one or more processors to: configure a processor of the one or more processors to perform congestion control and/or control plane software based on congestion control events and control plane events from at least one of the plurality of programmable event processors.
Example 15 includes one or more examples, and includes instructions stored thereon, that if executed by one or more processors, cause the one or more processors to: configure at least one of the plurality of programmable event processors of a network interface device programmed using an event graph description with defined nodes to update event data and protocol state.
Example 16 includes one or more examples, and includes instructions stored thereon, that if executed by one or more processors, cause the one or more processors to: configure a scheduler to schedule events to be processed by the plurality of programmable event processors based on time.
Example 17 includes one or more examples, and includes a method that includes: in a datacenter: a plurality of programmable event processors of a network interface device programmed to perform one or more transport layer protocols, wherein the plurality of programmable event processors are to perform memory accesses separate from compute operations, the plurality of programmable event processors are to group one or more events into at least one group, the plurality of programmable event processors are to enforce atomic processing of other events within a group of the at least one group, wherein the atomic processing comprises propagation of state changes to among events of the group, and the plurality of programmable event processors are to perform parallel processing of events belonging to different groups.
Example 18 includes one or more examples, and includes in the datacenter: a processor performing congestion control and/or control plane software based on congestion control events and control plane events from at least one of the plurality of programmable event processors.
Example 19 includes one or more examples, and includes in the datacenter: configuring at least one of the plurality of programmable event processors of a network interface device programmed using an event graph description with defined nodes to update event data and protocol state.
Example 20 includes one or more examples, and includes in the datacenter: a scheduler scheduling events to be processed by the plurality of programmable event processors based on time.
Example 21 includes one or more examples, wherein the transport layer protocols comprise two or more of: remote direct memory access (RDMA) over Converged Ethernet (RoCE), RoCEv2, scalable reliable datagram (SRD), Elastic Fabric Adapter (EFA), Distributed Universal Access (DUA) and Lightweight Transport Layer (LTL), High Precision Congestion Control (HPCC), improved RoCE NIC (IRN), Homa, NDP, and/or EQDS.
This application claims the benefit of and priority to U.S. Provisional Application No. 63/342,909, filed May 17, 2022, and U.S. Provisional Application No. 63/419,960, filed Oct. 27, 2022. The entire contents of those applications are incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63419960 | Oct 2022 | US | |
63342909 | May 2022 | US |