A network switch can include a central processing unit (CPU) coupled to a packet processor. The packet processor can be implemented as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA). The packet processor can be used to maintain a large number of count values and other state or status information that are periodically read out by software on the CPU to populate a central database or to perform other network operations. Using software on the CPU to actively read values out from the packet processor can consume a substantial amount of CPU time and interface bandwidth.
It can be challenging to design a network device with different software and hardware development cycles. Because the address map of counter values and other state variables need to be properly synchronized between the CPU and the packet processor, the software and hardware development cycles have to be kept in sync, placing added pressure on the teams designing the software on the CPU and the packet processing hardware. It is within this context that the embodiments herein arise.
A network device such as a router or switch may include a main processor (CPU) and an associated packet processor for processing data packets in accordance with a network protocol. The packet processor can be implemented as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) device. The packet processor can include a multi-interface bridge for connecting a group or tree of registers to a plurality of physical interfaces configured to support different communications protocols. The various physical interfaces can access the tree of registers using a universal address map that is shared among the plurality of interfaces. Use of a multi-interface bridge to support a plurality of physical interfaces enables the development of new hardware features while supporting new and old interfaces so that the new hardware is backwards compatible with existing software. The multi-interface bridge can thus help decouple the software development cycle from the hardware development cycle.
The packet processor can also include count aggregation circuitry configured to efficiently collect counter (count) values for each physical interface (also referred to as a port), to group the counter values of each port into a data packet, and to transmit or publish the data packet over the network. The count aggregation circuitry can include host interface counter circuitry and client interface counter circuitry. The client interface counter circuitry may be coupled to a plurality of ports. The client interface counter circuitry may include a plurality of partial sum circuits each of which is configured to obtain count values associated with a respective one of the ports, a plurality of sum select circuits configured to select a partial sum value to be output for accumulation, and a global accumulation circuit configured to receive selected partial sum values from the plurality of sum select circuits.
The count values stored in the global accumulation circuit can be duplicated to a memory in the host interface counter circuitry. The host interface counter circuitry can read count values associated with a particular port from the memory, package the count values into a data packet, and publish the data packet via a statistics publishing interface. Data packets generated in this way can be gathered and stored at a local server or a remote (central) database. Gathering count/statistics information in this way alleviates the need for software running on the main processor to actively read or poll all of the registers, counters, or other storage circuits from the packet processor.
Processor 12 may be used to run a network device operating system such as operating system (OS) 18 and/or other software/firmware that is stored on memory 14. Memory 14 may include non-transitory (tangible) computer readable storage media that stores operating system 18 and/or any software code, sometimes referred to as program instructions, software, data, instructions, or code. Memory 14 may include nonvolatile memory (e.g., flash memory or other electrically-programmable read-only memory configured to form a solid-state drive), volatile memory (e.g., static or dynamic random-access memory), hard disk drive storage, and/or other storage circuitry. The processing circuitry and storage circuitry described above are sometimes referred to collectively as control circuitry. Processor 12 and memory 14 are sometimes referred to as being part of a control plane of network device 10.
Operating system 18 in the control plane of network device 10 may exchange network topology information with other network devices using a routing protocol. Routing protocols are software mechanisms by which multiple network devices communicate and share information about the topology of the network and the capabilities of each network device. For example, network routing protocols may include Border Gateway Protocol (BGP) or other distance vector routing protocols, Enhanced Interior Gateway Routing Protocol (EIGRP), Exterior Gateway Protocol (EGP), Routing Information Protocol (RIP), Open Shortest Path First (OSPF) protocol, Label Distribution Protocol (LDP), Multiprotocol Label Switching (MPLS), Immediate system-to-immediate system (IS-IS) protocol, or other Internet routing protocols (just to name a few).
Packet processor 16 is oftentimes referred to as being part of a data plane or forwarding plane. Packet processor 16 may represent processing circuitry based on one or more microprocessors, general-purpose processors, application specific integrated circuits (ASICs), programmable logic devices such as field-programmable gate arrays (FPGAs), a combination of these processors, or other types of processors. Packet processor 16 receives incoming data packets via an ingress port 15, analyzes the received data packets, processes the data packets in accordance with a network protocol, and forwards (or drops) the data packet accordingly.
A data packet is a formatted unit of data conveyed over the network. Data packets conveyed over a network are sometimes referred to as network packets. A group of data packets intended for the same destination should have the same forwarding treatment. A data packet typically includes control information and user data (payload). The control information in a data packet can include information about the packet itself (e.g., the length of the packet and packet identifier number) and address information such as a source address and a destination address. The source address represents an Internet Protocol (IP) address that uniquely identifies the source device in the network from which a particular data packet originated. The destination address represents an IP address that uniquely identifies the destination device in the network at which a particular data packet is intended to arrive.
Data packets received in the data plane may optionally be analyzed in the control plane to handle more complex signaling protocols. Packet processor 16 may generally be configured to partition data packets received at ingress port 15 into groups of packets based on their destination address and to choose a next hop device for each data packet when exiting an egress port 17. The choice of next hop device for each data packet may occur through a hashing process (as an example) over the packet header fields, the result of which is used to select from among a list of next hop devices in a routing table stored on memory in packet processor 16. Such routing table listing the next hop devices for different data packets is sometimes referred to as a hardware forwarding table, a hardware forwarding information base (FIB), or a media access control (MAC) address table. The routing table may list actual next hop network devices that are currently programmed on network device 10 for each group of data packets having the same destination address. If desired, the routing table may also list actual next hop devices currently programmed for device 10 for multiple destination addresses (i.e., device 10 can store a single hardware forwarding table separately listing programmed next hop devices corresponding to different destination addresses). The example of
Packet processor 16 may include storage components for storing count values, state information, statistics information, or other data that might be used for performing routing or networking functions. For example, packet processor 16 may include register circuits such as registers 20 for storing data within the packet processor. The use of registers 20 is merely exemplary. In general, other types of data storage elements such as counters, memory, and block random-access memory (RAMs) can be included within packet processor 16. In some embodiments, the data can also be stored on one or more external memory devices that are directly coupled to packet processor 16.
The main processor 12 may need to access the information stored on registers 20 and the other data storage components associated with packet processor 16. Processor 12 may communicate with packet processor 16 via a processor-to-processor physical interface 13. Interface 13 may represent one or more processor-to-processor physical interfaces for communicating with different processors using different types of computer bus protocols. For example, processor-to-processor interface(s) 13 can include one or more I2C (Inter-Integrated Circuit) computer interface/bus, one or more PCIe (Peripheral Component Interconnect Express) computer interface/bus, one or more Ethernet computer interface/bus, one or more UART (Universal Asynchronous Reception and Transmission) computer interface/bus, one or more SPI (Serial Peripheral Interface) computer interface/bus, one or more RapidIO computer interface/bus, one or more Interlaken computer interface/bus, one or more AGP (Accelerated Graphics Port) computer interface/bus, and/or other types of processor-to-processor interface.
Conventionally, a packet processor might include only one type of physical interface for communicating with a corresponding CPU on the same network device. In such scenarios, any hardware updates to the packet processor will also require a corresponding software update to the CPU, thus placing a tight coupling requirement between the software development cycle and the hardware development cycle.
In accordance with an embodiment, packet processor 16 may be provided with a multi-interface bridge component operable with different types of physical interfaces.
Network 22 may include storage components (elements) 20 coupled together using a plurality of transport layer (TL) nodes 24. As the name suggests, the “transport layer” nodes can represent or can be defined as connection nodes in layer 4 (L4) of the OSI (Open Systems Interconnection) model, which sits between the higher application layer and the lower Internet layer. The transport layer nodes 24 can be connected to form a tree-like network. If desired, some of the transport layer nodes 24 can optionally be connected in a daisy chain, as shown by dotted path 30. Storage components 20 may be connected to the “leaf” (endpoint) nodes of the tree-like network. Storage components 20 may represent registers, counters, memory, block random-access memory (RAMs), and/or other types of data storage elements for storing count values, state information, statistics information, or other information that might be used for performing routing or networking functions.
Network 22 configured in this way is therefore sometimes referred to as a transport layer routing and data storage network. The example of
Configured in this way, each interface 28 can separately access the data storage components 20 within the tree-like network 22 using a universal address map. When data storage components 20 are registers, the universal address map used by multi-interface bridge 26 to access the network (or tree) of registers 20 can be referred to as a register map.
The three modes of
As described above, packet processor 16 can include data storage elements (e.g., counters 20) for storing count values associated with one or more input-output ports of processor 16. Packet processor 16 can simultaneously communicate with multiple endpoint devices via a plurality of ingress and egress ports. The ingress and egress ports are therefore sometimes referred to collectively as client (endpoint) ports or client (endpoint) input-output (I/O) ports. As examples, packet processor 16 can include at least two input-output ports for communicating with up to two endpoint or client devices, more than two input-output ports for communicating with more than two endpoint or client devices, two to ten input-output ports for communicating with up to ten endpoint or client devices, 10-20 input-output ports for communicating with up to 20 endpoint or client devices, 20-50 endpoint input-output ports for communicating with up to 50 endpoint or client devices, or more than 50 endpoint input-output ports for communicating with more than 50 endpoint or client devices.
Packet processor 16 may include counters for keeping track of count values associated with each input-output port and aggregation circuitry for summing up the various count values. Packet processor 16 may also include statistics publishing circuitry for efficiently publishing the aggregated count values.
Host interface counter circuitry 40 can be used to interface with a host device or controller such as main processor (CPU) 12 and can receive a first clock signal such as host clock signal clk_host and a second clock signal such as client/core clock signal clk_core. Host interface counter circuitry 40 can include memory such as memory module 46 for receiving count values or statistical data aggregated at client interface counter circuitry 42. Host interface counter circuitry 40 can be coupled to one or more register interfaces 28 such as physical interfaces 28-1, 28-2, and 28-3 of the type described in connection with
Host interface counter circuitry 40 can also be coupled to a host CPU 12 via statistics publishing interface 44. Host interface counter circuitry 40 can collect all relevant counter values associated with any particular input-output port into a data structure and then transmit or publish the collected counter values and statistical information in the form of one or more data packet(s). For example, host interface counter circuitry 40 can transmit UDP (User Datagram Protocol) packets containing TLV (Type, Length, Value) data structures containing all counter values for each input-output port. A TLV record can include a TLV number, a base port number, a port counter indicating the total number of supported ports, and the number of counters per record. The TLV number can be a 16-bit number (as an example). In other embodiments, the TLV number can have fewer than 16 bits or can have more than 16 bits. Each counter in the record can be a 32-bit counter (as an example). In other embodiments, individual counter values can be fewer than 32 bits or can be greater than 32 bits.
Client interface counter circuitry 42 can receive client/core clock signal clk_core and can be used to monitor counter values associated with each of the N input-output ports (see, e.g., ports P1, P2, . . . , and PN coupled to client interface counter circuitry 42). Client interface counter circuitry 42 can include a plurality of counters or other counting circuits for monitoring count values per direction for each of the N ports. For example, counter circuitry 42 can increment, on a per port basis, a transmit (TX) count in response to transmission related toggling events, including but not limited to transmitting one or more unicast packets, one or more multicast packets, one or more broadcast packets, a transmit discard packets count, a transmit byte count, transmit error signal information, or transmit frame pausing information. Similarly, counter circuitry 42 can increment, on a per port basis, a receive (RX) count in response to reception related toggling events, including but not limited to receiving one or more unicast packets, one or more multicast packets, one or more broadcast packets, a receive discard packets count, a receive byte count, receive error signal information, or receive frame pausing information. The various count values accumulated in client interface counter circuitry 42 can optionally be cleared by one or more clear counter signals clear_counters received from host interface counter circuitry 40.
The various count values accumulated in client interface counter circuitry 42 can be written to memory module 46 on host interface counter circuitry 40. In some embodiments, accumulated count values stored within client interface counter circuitry 42 can be duplicated or mirrored onto memory module 46, and the duplicated count values can then be published (e.g., in the form of one or more UDP packets) to a corresponding host processor and/or sent to a central server or database. The aggregated counter or state information can be sent to a remote server (e.g., a UDP server) via one or more Ethernet interface, one or more Interlaken interface, or other types of network interface. Aggregating and publishing the counter values and other internal state information for the input-output ports as fully formed network packets in this way alleviates the need for software running on the CPU to have to actively poll or read all the counter/register values while allowing the collection of such counter data to be offloaded to some other server or cloud service, thus helping to remove compute overhead from the local network device.
Client interface counter circuitry 42 may include hardware configured to monitor the counter values associated with each input-output port.
The partial sum circuits may include at least a first partial sum circuit 50-1 configured to keep track of transmit and/or receive toggle events associated with a first input-output port P1, a second partial sum circuit 50-2 configured to keep track of transmit and/or receive toggle events associated with a second input-output port P2, and a third partial sum circuit 50-3 configured to keep track of transmit and/or receive toggle events associated with a third input-output port P3. In general, client interface counter circuitry 42 can include any number of partial sum circuits 50 configured to monitor TX and/or RX toggle events associated with any number of input-output ports at the packet processor. Each partial sum circuit 50 can include a group (set) of counters 58 each keeping track of a partial sum. Counters 58 can sometimes be referred to as partial sum counters. Counters 58 can each be a relatively small counter circuit such as a 5-bit counter, a 6-bit counter, a 4-6 bit counter, or a 3-7 bit counter. The group of counters 58 can be coupled to a shared adder logic 60 to help minimize the total amount of resources required to implement the interface counters 58. Each partial sum circuit 50 can include a total of forty to sixty counters 58, thirty to seventy counters 58, at least 20 counters 58, at least 30 counters 58, more than fifty counters 58, more than sixty counters 58, sixty to a hundred counters 58, or more than a hundred counters 58.
The partial sums from the various partial sum counters 58 can, as an example, be output as an array of SLVs (standard logic vectors) containing all the partial sums to a corresponding sum select circuit 52. In the example of
The sum select circuit 52 can be configured to select a partial sum value to be output for accumulation to global accumulator circuit 56. The sum select circuits 52 can be connected together in a chain. As shown in the example of
Connected in this way, the enable pulse SS_en from global accumulator 56 can trigger a chain reaction that enables the output and clearing of each partial sum in a sequential fashion. The first partial sum can be selected for output and cleared two cycles after the enable pulse arrives at a given sum select circuit. This allows the sum select circuit at least one pipeline stage to generate the corresponding clear signal for the partial sum circuit and also one cycle for the sum select circuit to perform the input multiplexing across the array of partial sums. The next partial sum can be output and cleared in a subsequent clock cycle. The done signal can be pulsed two cycles before the last partial sum is output and cleared from the sum select circuit, which allows the done signal of one sum select circuit to be fed into the enable input of the next sum select circuit to trigger a similar operation in the next sum select circuit in the chain. When a sum select circuit is not enabled, the output bus is zeroed so that the outputs of all sum select circuits can be combined using a logic OR operation (see logic OR component 54) without needing a multiplexer with an explicit select signal. Alternatively, a multiplexing circuit can be used in place of logic OR module 54 to select from among the outputs of the N sum select circuits 52.
Global accumulator circuit 56 can be configured to receive and accumulate (aggregate) the partial sum values output from the various sum select circuits 52. As described above, global accumulator circuit 56 can generate an initial enable signal (pulse) to trigger a chain reaction through the multiple sum select circuits 52, where the output of all of the sum select circuits can be OR′ed together to provide a single partial sum for the currently selected (currently enabled) partial sum. The combined OR′ed partial sum value can be accumulated by global accumulator 56 and stored in a memory circuit such as block RAM (BRAM) 57 in global accumulator 56. Global accumulator 56 can synchronize the read and write addresses to RAM 57 in such a way that an output of the sum select circuits and a value read from RAM 57 arrive at appropriate times to accumulate a counter value and then write the accumulated value back into the same location from which it was read (e.g., a value is read from the block RAM, added to the currently received partial sum value, and the corresponding sum is then written back into the block RAM).
Counter values accumulated in RAM 57 are allowed to roll over. To enable roll over, RAM 57 can be accessed using a read point and a separate write pointer. The read access can be performed ahead of the write access (as an example). The use of both read and write ports of RAM 57 can help maximize the rate at which the counters within circuitry 42 are updated. As an example, RAM 57 can be configured to implement 512 counters or entries (e.g., effectively representing 512 32-bit counters). The 512 counters can be refreshed or accumulated once every 512 clock cycles, which can help minimize the size of the counters 58 within each partial sum circuit, which are relatively smaller counters. This example is merely illustrative. In other embodiments, RAM 57 may implement 128 multi-bit counters that are refreshed once every 128 cycles, 256 multi-bit counters that are refreshed once every 256 cycles, 1024 multi-bit counters that are refreshed once every 1024 cycles, fewer than 512 counters/entries, more than 512 counters/entries, or any desired number of multi-bit counters or registers. Global accumulator 56 may include one or more BRAMs 57 each having the same or different number of counters.
Global accumulator circuit 56 can output a write value and a corresponding write address to host interface counter circuitry 40 in order to mirror (copy) the contents of RAM 57 into memory 46 within circuitry 40 (see
During the operations of block 72, the sum select circuits can sequentially select a partial sum value to be output for aggregation at the global accumulation circuit and can generate a partial sum clear signal back to the partial sum circuits. The sum select circuits can be connected in a chain. The first sum select circuit in the chain can receive an enable pulse from the global accumulate circuit. A partial sum can be selected for output and cleared two cycles after the enable pulse arrives at a given sum select circuit. A next partial sum can be output and cleared in a subsequent clock cycle. The done signal can be pulsed two cycles before the last partial sum is output and cleared from the sum select circuit. When a sum select circuit is not enabled, the output bus is zeroed so that the outputs of all sum select circuits can be combined using a logic OR operation.
During the operations of block 74, the partial sum values output from the sum select circuits can be accumulated at the global accumulation circuit. Although block 74 is shown as occurring after blocks 70 and 72, the operations of block 74 can sometimes occur in parallel or simultaneously with the operations of block 70 and 72. The global accumulator circuit can include an internal memory module such as block RAM 57 for implementing a relatively large counter. As examples, block RAM 57 can include at least 128 entries, 256 entries, 512 entries, 1024 entries, or more than 500 entries. The global accumulator circuit can synchronize the read and write addresses to block RAM 57 in such a way that an output of the sum select circuits and a value read from RAM 57 arrive at appropriate times to accumulate a counter value and then write the accumulated value back into the same location from which it was read.
During the operations of block 76, the accumulated count values stored within RAM 57 can be duplicated or mirrored to memory 46 within the host interface counter circuitry 40. This can allow the count values to be read out asynchronously from the host interface counter circuitry 40 at a different clock rate using host clock signal clk_host that is separate from the internal clock clk_core controlling client interface counter circuitry 42.
During the operations of block 78, the host interface counter circuitry 40 can read (e.g., from mirrored memory 46) counter, status, or other accumulated statistical information for any particular input-output port, package the count values into a data packet (e.g., a UDP data packet or other type of datagram), and transmit the data packet via the statistics publishing interface 44 (see
The foregoing embodiments may be made part of a larger system.
As an example, network device 100 can be part of a host device that is coupled to one or more output devices 102 and/or to one or more input device 104. Input device(s) 104 may include one or more touchscreens, keyboards, mice, microphones, touchpads, electronic pens, joysticks, buttons, sensors, or any other type of input devices. Output device(s) 106 may include one or more displays, printers, speakers, status indicators, external storage, or any other type of output devices.
System 120 may be part of a digital system or a hybrid system that includes both digital and analog subsystems. System 120 may be used in a wide variety of applications as part of a larger computing system, which may include but is not limited to: a datacenter, a computer networking system, a data networking system, a digital signal processing system, a graphics processing system, a video processing system, a computer vision processing system, a cellular base station, a virtual reality or augmented reality system, a network functions virtualization platform, an artificial neural network, an autonomous driving system, a combination of at least some of these systems, and/or other suitable types of computing systems.
The methods and operations described above in connection with
The foregoing is merely illustrative and various modifications can be made to the described embodiments. The foregoing embodiments may be implemented individually or in any combination.