The present invention relates to techniques for modeling and synthesizing circuits for packet processing and, more particularly, to methods and apparatus for modeling and synthesizing circuits for packet processing using a packet editing graph.
Packet switches, routers or other packet forwarding elements are a basic building block in any data communication network. The primary role of a packet switch is to forward packets from one port to another port based on the contents of each packet, specifically header data at the beginning of each packet. As part of this forwarding operation, packets are classified, queued, modified, transmitted, or dropped.
The forwarding algorithms used in most switches are relatively simple by design to facilitate efficient hardware implementations. However, performance considerations make the forwarding algorithms tedious to code in a standard Register Transfer Logic (RTL) flow. In particular, hardware implementations of forwarding algorithms are typically deeply pipelined circuits that operate on wide buses (e.g., 128 or 256 bits) and interact with high-speed first-in-first-out (FIFO) buffers through a rigid hand-shaking protocol. Thus, their control finite-state machines are complicated and difficult to write correctly.
A need therefore exists for improved techniques for synthesizing such packet processing circuits.
Generally, methods and apparatus are provided for modeling and synthesizing circuits for packet processing that transform one or more fields of a packet. According to one aspect of the invention, a circuit for packet processing that transforms one or more fields of a packet is modeled by representing the transformation using a packet editing graph having at least one node. The transformation can comprise one or more of adding, removing, modifying and maintaining the at least one field of a packet header. The packet editing graph can have at least one conditional node which has a plurality of output branches, wherein a value of at least one of the fields is determined by selecting a corresponding one of the output branches based on a value of a predicate applied to the conditional node. The packet editing graph can also include one or more of arithmetic and logical operators and connections among one or more of inputs, operators and outputs.
According to another aspect of the invention, a circuit for packet processing that transforms one or more fields of a packet is synthesized by synthesizing a control finite state machine based on a packet editing graph having at least one node, wherein the packet editing graph represents the circuit for packet processing. Nodes in the packet editing graph are transformed into registers. Conditional nodes in the packet editing graph are transformed into a multiplexer controlled by the control finite state machine. Arithmetic and logical operators in the packet editing graph are transformed into one or more combinatorial circuits. A wrapper function is also synthesized that surrounds the synthesized core, wherein the wrapper function identifies packet boundaries using one or more signal flags.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.
The present invention provides a high-level synthesis technique for packet editing blocks. Although the present invention is illustrated herein in the context of Field Programmable Gate-Arrays (FPGAs), the techniques of the present invention could also be adapted for use with application specific integrated circuits (ASICs), as would be apparent to a person of ordinary skill in the art.
As used herein, the term “line card” does not necessarily represent a physical card or a circuit pack in the system (i.e., there could be switches that divide the functionality of a “line card” into multiple cards and daughter cards. Further, there could be switches that capture a switch fabric along with multiple line cards onto one physical card, or switches that do not have a switch fabric at all (e.g., one line card with multiple ports or multiple line cards with full mesh connection on the backplane). All of these types of partitioning (and beyond) would be apparent to a person of ordinary skill in the art and the disclosed synthesis methods apply regardless of the physical system partition.
As discussed hereinafter, the synthesis techniques of the present invention can be applied to designing the line cards 180, which provide network interfaces, make forwarding and scheduling decisions, and, most critically, modify packets according to their contents. According to various aspects of the invention, a technique is provided for modeling modules or algorithms that transform input packets into output packets and a synthesis procedure is also provided that translates the packet transformation algorithms into efficient synthesizable Hardware Description Language (HDL), such as a Very High Speed Integrated Circuit (VHSIC) HDL (VHDL) or Verilog code. As used herein, a packet “transformation” includes inserting a field, removing a field, changing the contents of a field, and maintaining the existing contents of a field (a “no operation”). The disclosed techniques produce standalone packet editing blocks that are easily connected in pipelines. The modeling techniques of the present invention are easier to write and maintain than traditional RTL descriptions. Maintainability of already-deployed systems is becoming increasingly important with the widespread use of FPGAs that allow hardware components of the system to be updated in the field in the same manner as the software.
Circuits for Packet Processing and Packet Editing
The disclosed synthesis techniques build components in the ingress (input) and egress (output) circuits for packet processing 110, 150, such as the exemplary flow shown in
Circuits for packet processing 110, 150 perform complex tasks but are usually designed by composing simpler functions in a graph whose topology reflects the packet flow. This model has been used for software implementations on hosts and on switches. See, e.g., S. O'Malley and L. L. Peterson, “Dynamic Network Architecture,” ACM Transactions on Computer Systems, 10(2), 110-143 (1992); or E. Kohler et al., “The Click Modular Router,” ACM Transactions on Computer Systems, 18(3), 263-297 (August 2000). Alternatively, a pool of task-specific threads may process the same packet in parallel without actually moving the packet. See, G. Brebner et al., “Hyper-Programmable Architecture for Adaptable Networked Systems,” Proc. of the 15th Int'l Conf. on Application-Specific Architectures and Processors (2004).
A restricted protocol graph model is employed that prohibits loops, so the packet flow through the processors 110, 150 is unidirectional. This restriction simplifies the implementation without introducing major limitations. For example, the loops in the IP router described by Kohler et al. only handle exceptions. This is better done by a control processor, i.e., outside the packet processing pipeline. It is noted, however, that such loops can be manually modeled, using existing techniques.
While the logical flow can fork and join, the disclosed embodiment employs linear pipelines. Multiple logic flows can be achieved, for example, by setting flags in the control header that instruct later stages to pass the packet intact. Thus, the disclosed techniques can emulate a logical flow having forks and joins using a linear pipeline. For example, consider a module A that branches to two modules 1 and 2, and then rejoins at module B, with module 1 processing a packet flow a and module 2 processing a packet flow b. A linear pipeline can be established with modules A, 1, 2 and B in series, and module 1 ignoring packet flow b and module 2 ignoring packet flow a. A non-linear pipeline would be more complicated and could only improve latency, not throughput. Finally, if the packet needs to be dropped or forwarded to the control processor, flags are set in the control header and the action is performed at the end of the pipeline. In this manner, all processing elements see all packets that enter the pipeline in the same order. Switches often do reorder packets, but this is usually done by the traffic manager 120, 140.
Both the VLAN push 220 and MPLS push 230 modules insert additional headers after the Ethernet header, while the Time-To-Live (TTL) update 240 and Address Resolution Protocol (ARP) resolution 250 modules only modify existing packet fields. The VLAN pop module 210 removes a header from the packet. While this pipeline is fairly simple, a realistic switch only performs more such operations, not more complicated ones.
Thus, packet processing amounts to adding, removing, and modifying fields. Even the flow classification stage, which typically involves a complex search operation, ultimately just produces a modified header (i.e., a control header with a Flow ID field). These operations are referred to as “packet editing,” which is the fundamental building block of a circuit for packet processing.
In addition to the main pipeline,
As used herein, the term “memory lookup” shall include, for example, the presentation of an address to the memory and memory returning the content of the location, as well as the presentation of a search key to the memory and having the memory search through some internal data structure to retrieve data associated with the presented key (often referred to as a Content Addressable Memory (CAM) or an Associative Memory). In addition, a “memory lookup” can include, for example, a module telling the external block (“memory”) that it saw a packet belonging to a certain flow (i.e., presents a binary number that uniquely identifies a flow). Rather than storing some simple data associated with that address, a memory could store information about the arrival history of that flow and the data returned could be information on whether or not packets belonging to the flow are coming too frequently. This type of “memory,” along with a module synthesized using the disclosed method, can then be used to implement a traffic policing function.
Modules that use memory lookup rely on a previous pipeline stage to issue the search request, which will be processed in parallel with later pipeline stages to hide memory latency. The disclosed synthesis techniques do not synthesize memory lookup blocks, but can generate search requests and consume search results. Because the exemplary pipeline of
Hence, circuits for packet processing are modeled as a linear pipeline of processing elements with four types of interfaces: (i) input from the previous pipeline stage, (ii) output to the next stage, (iii) search requests to memory, and (iv) search results from memory. The processing element must be capable of generating the search request and editing the packet based on the packet content and the data structure retrieved from memory.
A control finite state machine (FSM) is synthesized during step 350. The end of each path is merged during step 360 to a common REP state that is accompanied by an auxiliary align register. Finally, computational nodes are translated during step 370 directly into combinational logic to form the datapath. Each of these steps is discussed in further detail below.
Packet Editing Graph
While the behavior of a single node in a packet editing pipeline could be modeled, for example, at the register-transfer level, doing so would be awkward for these deeply pipelined circuits that must operate on many bits in parallel. Instead, the present invention employs a Packet Editing Graph (PEG) as an abstract model for describing such nodes. This type of model is easier to design and modify (because it hides implementation details), and it can be synthesized into very efficient circuitry.
A PEG is an acyclic, directed graph 400 consisting of four classes of elements;
inputs 410, 415 (the packet 410 itself and data 415 from the memory lookup block 260, drawn as rectangles in
The packet map 450 (i.e., the control-flow graph on the right) is an important aspect of a PEG. The bits of the output packet are assembled by starting at the top of this graph 450 and traversing the graph downward. Diamond-shaped nodes, such as node 460, are conditionals. Output packets are generated by proceeding down the left or right branch of each conditional node based on the value of the predicate fed to conditionals. In this manner, bits from the output packet can be inserted and deleted. The final-node 480, marked with dots, copies the remainder of the input packet to the output.
Synthesis Procedure
A significant challenge in synthesizing a circuit from a PEG 400 is converting the flat, bit-level PEG specification into the sequential word-level implementation needed for performance. This can be non-trivial because operand and result bit fields are generally not on word boundaries, and some results may depend on operands that appear later in the input packet. Moreover, a PEG allows conditional insertions and removals, so there is not always a simple mapping between the word in which an input byte appears and the word in which it appears in the output.
The disclosed synthesis procedure analyzes the PEG 400, establishes the necessary mapping, and builds both a datapath and a controller that produces the required behavior.
A. Wrappers and the Module Interface
The present invention creates synthesizable RTL for an element by instantiating a hand-written wrapper around the core synthesized from a PEG 400. The wrapper adapts the simple core interface to the particular protocol used between blocks and buffers.
Cores, such as the core 520, receive and send packets over a w-byte parallel interface (w equal to 8 or 16 is typical with existing technologies). The module 520 sees the input packet as a sequence of w-byte words arriving sequentially on the idata port of
In addition, as shown in
For modules 520 having auxiliary inputs, such as memory reads, the synthesized core 520 assumes that input data is present and stable at that input all the time during the currently processed packet. Thus, for the core 520, an auxiliary input is seen as a constant parameter (i.e., one packet having one value).
It it thus the duty of the synthesized wrapper 500 to perform the following:
a. stall the core if the auxialiry input data has not arrived yet; and
b. switch the input value from the current value to the next value exactly when the core 520 switches from processing the current packet to the next packet.
For modules having auxiliary outputs, such as memory writes, the core 520 writes to the auxiliary output as soon as the data to be written is computed. The core 520 assumes that it is possible to write the data. Thus, it is the duty of the wrapper 500 to stall the core 520 if the receiver cannot accept the data (i.e., the module is back-pressured).
B. Splitting Data into Words
The packet map is restructured so that conditions are only checked at the beginning of each word. This guarantees that only complete words are generated in each cycle except the last (a special case). For example, the >0 condition in
The algorithm in
The restructuring procedure 600 has the potential of generating an exponentially large tree, but in practice this is not a problem because protocols are designed to avoid it; furthermore, the process 600 reconverges whenever possible. For example, there are four different paths in
Reconvergence is handled by maintaining a cache of nodes that can be reused safely. If a node visit is “clean,” that is, the pending vector is empty, the cache is checked for a previous visit and reused if possible.
C. Assigning Read Cycle Indices
After splitting the packet map into word-size chunks using the restructuring procedure 600 of
The first input word index is zero, the second is one, and so forth. The remaining word indexes are computed by observing the obvious causality relationship: the index of a node is the highest index of all its predecessors. Constant nodes and memory inputs, assumed to be present in all cycles, are therefore ignored.
D. Scheduling
Once the read cycle indices are assigned in the manner shown in
1. If two indices differ by k>0, at least k bubbles are needed between them.
2. Any two output nodes in the packet map, even with the same index, require at least one bubble between them.
In
To comply with the first rule, exactly k bubbles are inserted on any arc between nodes with different indices. It is harder to comply with the second rule.
Following the second rule, a bubble may be inserted in the two positions shown in
E. Synthesizing the Controller
Once the read cycle indices and bubbles have been added in the manner discussed above, the next step is to synthesize the control finite state machine (FSM). The structure of the control finite state machine follows that of the packet map. Bubbles along arcs in the packet map correspond to states; replacing them with registers leads to a one-hot encoding. The topmost bubble is the initial state. Bubbles adjacent to the leaves are special states that repeat copying data until ieop is detected, after which the FSM goes to the initial state.
F. Handling the End of a Packet
All paths in the packet map reconverge to no more than w different states in the end, corresponding to alignments ranging from no shifting necessary to shifting w−1 bytes. However, rather than leaving these states separate, the disclosed algorithm merges the end of each path to a common REP state that is accompanied by an auxiliary align register of size log2 (w). The align register is loaded on any transition that leads to REP.
Using the align register, the REP state performs two tasks. First, it aligns the data using a multiplexer. In
G. Synthesizing the Data Path
The computational nodes are translated directly into combinational logic; they form the datapath. Furthermore, bubbles are translated into registers to guarantee that any node with read cycle index i has a valid value on that respective cycle.
Since a read cycle index may correspond to several clock cycles, registers must keep their values in such cases. The exemplary embodiment employs a simple approach where all registers hold their value when the present and next state are equal; otherwise they are loaded. A more efficient scheme is possible from noticing that for a register driving a node with index i, the output is “don't care” unless the FSM next state has also index i.
The datapath is pipelined by adding an arbitrary number of registers (usually 1-3) at the module outputs (e.g., odata or owr). These will be likely backward retimed inside the combinational logic by the RTL synthesis tool, because they do not belong to critical sequential cycles, thus improving performance.
When the wrapper 500 asserts the suspend signal, the core module 520 holds all the registers, both in the controller FSM 1100 and in the data path.
System and Article of Manufacture Details
As is known in the art, the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer readable medium having computer readable code means embodied thereon. The computer readable program code means is operable, in conjunction with a computer system, to carry out all or some of the steps to perform the methods or create the apparatuses discussed herein. The computer readable medium may be a recordable medium (e.g., floppy disks, hard drives, compact disks, or memory cards) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used. The computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic media or height variations on the surface of a compact disk.
The computer systems and servers described herein each contain a memory that will configure associated processors to implement the methods, steps, and functions disclosed herein. The memories could be distributed or local and the processors could be distributed or singular. The memories could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices. Moreover, the term “memory” should be construed broadly enough to encompass any information able to be read from or written to an address in the addressable space accessed by an associated processor. With this definition, information on a network is still within a memory because the associated processor can retrieve the information from the network.
It is to be understood that the embodiments and variations shown and described herein are merely illustrative of the principles of this invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.