The present invention relates to telecommunications networks, especially packet switched telecommunications networks and particularly to network elements and communication modules therefor, and methods of operating the same for processing packets, e.g. at nodes of the network.
Dealing with the processing of packets arriving at a high rate at, for instance, a node of a telecommunications network, in a deterministic and flexible way, preferably requires an architecture that takes into account the particularities of dealing with packets, while considering flexible processing elements such as processor cores. Ideal properties of packet processing are inherent parallelism in processing packets, high I/O (input/output) requirements in both the data plane and control plane (on which a single processing thread can stall) and extremely small cycle budgets which need to be used as efficiently as possible. Parallel processing is advantageous for packet processing in high throughput packet-switched telecommunications networks in order to increase processing power.
Although processing may be carried on in parallel, certain resources which need to be accessed are not duplicated. This results in more than one processing element wishing to access such a resource. A shared resource, e.g. a database, is one which is accessible by a plurality of processing elements. Each processing element can be carrying out an individual task which can be different from tasks carried out by any other processing element. As part of the task, access to a shared resource may be necessary e.g. to a database to obtain relevant in-line data. When trying to maximize throughput, accesses to shared resources of the processing elements generally have a large latency. If a processing element is halted until the reply from the shared resource is received the efficiency is low. Also resources requiring large storage space are normally located off chip so that access and retrieval times are significant.
Conventionally, optimizing processing on a processing element having for example, a processing core, involves context switching, that is one processing thread is halted and all current data stored in registers is saved to memory in such a way that the same context can be recreated at a later time when the reply from the shared resource is received. However, context switching takes up a large amount of processor resources or alternatively, time if only a small amount of processor resources is allocated to this task.
It is an object of the present invention to provide a packet processing element and a method of operating the same with improved efficiency.
It is a further object of the present invention to provide a packet processing element and a method of operating the same with which context switching involves a low overhead on processing time and/or low allocation of processing resources.
It is a further object of the present invention to provide an efficient packet processing element and a method of operating the same using parallel processing.
The present invention solves this problem and achieves a very high efficiency while keeping a simple programming model, without requiring expensive multi-threading on the processing elements and with the possibility to tailor processing elements to a particular function. The present invention relies in part on the fact that, with respect to context switching, typically there is little useful context, or useful context can be reduced to a minimum by judicious task programming, when a shared resource request is launched in a network element of a packet switched telecommunications network. Switching to process another packet does not necessarily require saving the complete state of a processing element. The judicious programming can include organizing the program to be run on each processing element as a sequence of function calls, each call having a context when run on a processing element but requiring no interfunction calls, except for the data in the packet itself.
Accordingly, the present invention provides a method of processing data packets in a packet processing apparatus for use in a packet switched network, the packet processing apparatus comprising a plurality of parallel pipelines, each pipeline comprising at least one processing unit for processing a part of a data packet, the method further comprising: organizing the tasks performed by each processing unit into a plurality of functions such that there are substantially only function calls and no interfunction calls and that at the termination of each function called by the function call for one processing unit, the only context is a first data portion.
The present invention provides a packet processing apparatus for use in a packet switched network, comprising: means for receiving a packet in the packet processing apparatus; means for adding to at least a first data portion of the packet administrative information including at least an indication of at least one process to be applied to the first data portion; a plurality of parallel pipelines, each pipeline comprising at least one processing unit, and the at least one processing unit carrying out the at least one process on the first data portion indicated by the administrative information to provide a modified first data portion.
The present invention also provides a communications module for use in a packet processing apparatus, comprising: means for receiving a packet in the communication module; means for adding to at least a first data portion of the packet administrative information including at least an indication of at least one process to be applied to the first data portion; a plurality of parallel communication pipelines, each communication pipeline being for use with at least one processing unit, and a memory device for storing the first data portion.
The present invention also provides a method of processing data packets in a packet processing apparatus for use in a packet switched network, the packet processing apparatus comprising a plurality of parallel pipelines, each pipeline comprising at least one processing unit, the method comprising: adding to at least a first data portion of the packet administrative information including at least an indication of at least one process to be applied to the first data portion; and the at least one processing unit carrying out the at least one process on the first data portion indicated by the administrative information to provide a modified first data portion.
The present invention also provides a packet processing apparatus for use in a packet switched network, comprising: means for receiving a packet in the packet processing apparatus; a module for splitting each packet received by the packet processing apparatus into a first data portion and a second data portion; means for processing at least the first data portion; and means for reassembling the first and second data portions.
The present invention also provides a method of processing data packets in a packet processing apparatus for use in a packet switched network, comprising splitting each packet received by the packet processing apparatus into a first data portion and a second data portion; processing at least the first data portion; and reassembling the first and second data portions.
The present invention also provides a packet processing apparatus for use in a packet switched network, comprising: means for receiving a packet in the packet processing apparatus; a plurality of parallel pipelines, each pipeline comprising at least one processing element, a communication engine linked to the at least one processing element by a two port memory unit, one port being connected to the communication engine and the other port being connected to the processing element.
The present invention also provides a communications module for use in a packet processing apparatus, comprising: means for receiving a packet in the communications module; a plurality of parallel communication pipelines, each communication pipeline comprising at least one communication engine for communication with a processing element for processing packets and a two port memory unit, one port of which being connected to the communication engine.
The present invention also provides a packet processing unit for use in a packet switched network, comprising: means for receiving a data packet in the packet processing unit; a plurality of parallel pipelines, each pipeline comprising at least one processing element for carrying out a process on at least a portion of a data packet, a communication engine connected to the processing element, and at least one shared resource, wherein the communication engine is adapted to receive a request for a shared resource from the processing element and transmit it to the shared resource. The communication engine is also adapted to receive a reply from the shared resource(s).
The present invention also provides a communication module for use with a packet processing unit, comprising: means for receiving a data packet in the communication module; a plurality of parallel pipelines, each pipeline comprising at least a communication engine having means for connection to a processing element, and at least one shared resource, wherein the communication engine is adapted to receive a request for a shared resource and transmit it to the shared resource and for receiving a reply from the shared resource and to transmit it to the means for connection to the processing element.
The present invention will now be described with the help of the following drawings.
a and 1b show a packet processing path in accordance with an embodiment of the present invention.
a and b show dispatch operations on a packet in accordance with an embodiment of the present invention.
a shows the location of heads in a FIFO memory associated with a processing unit in accordance with an embodiment of the present invention.
b shows a head in accordance with an embodiment of the present invention.
The present invention will be described with reference to certain embodiments and drawings but the present invention is not limited thereto. The skilled person will appreciate that the present invention has wide application in the field of parallel processing and/or in packet processing in telecommunications networks, especially packet switched telecommunication networks.
One aspect of the present invention is a packet processing communication module which can be used in a packet processing apparatus for packet header processing. The packet processing apparatus consists of a number of processing pipelines, each consisting of a number of processing units. The processing units include processor elements, e.g. processors and associated memory. The processors may be microprocessors or may be programmable digital logic elements such as Programmable Array Logic (PAL), Programmable Logic Arrays (PLA), Programmable Gate Arrays, especially Field Programmable Logic Arrays. The packet processing communication module comprises pipelined communication engines which provide non-local communication facilities suitable for processing units. To complete a packet processing apparatus, processor cores and optionally other processing blocks are installed on the packet processing communication module. The processor cores do not need to have a built-in local hardware context switching facility.
In the following the present invention will be described mainly with respect to the completed packet processing apparatus, however it should be understood that the type and size of the processor cores used with a packet processing communication module in accordance with the present invention is not necessarily a limitation on the present invention and that the communications module (without processors) is also an independent aspect of the present invention.
One aspect of the present invention is an optimized software/hardware partitioning. For example, the processing elements are preferably combined with a hardware block called the communication engine, which is responsible for non-local communication. This hardware block may be implemented in a conventional way, e.g. as a logic array such as a gate array. However, the present invention may be implemented by alternative arrangements, e.g. the communication engine may be implemented as a configurable block such as can be obtained by the use of programmable digital logic elements such as Programmable Array Logic (PAL), Programmable Logic Arrays (PLA), Programmable Gate Arrays, especially Field Programmable Logic Arrays. In particular, in order to provide product as soon as possible the present invention includes an intelligent design strategy over two or more generations whereby in the first generation programmable devices are used which are replaced in later generations with dedicated hardware blocks.
Hardware blocks are preferably used for protocol independent functions. For protocol dependent functions it is preferred to use software blocks which allow reconfiguration and reprogramming if the protocol is changed. For example, a microprocessor may find advantageous use for such applications.
A completed packet processing apparatus 10 according to an embodiment of the present invention comprises a packet processing communication module with installed processors. The processing apparatus 10 has a packet processing path as shown in
As shown schematically in
Typically, one or more shared resources SR14 are available to the processing path, which handle specific tasks for the processing units in a pipeline. For example, these shared resources can be dedicated lookup engines using data structures stored in off-chip resources, or dedicated hardware for specialized functions which need to access shared information. The present invention is particularly advantageous in increasing efficiency when these shared resource engines which are to be used in a processing system respond to requests with a considerable latency, that is a latency such as to degrade the efficiency of the processing units of the pipeline if each processing unit is halted until the relevant shared resource responds. Typical shared resources which can be used with the present invention are an IP forwarding table, an MPLS forwarding table, a policing data base, a statistics database. For example, the functions that are performed by the pipeline structure assisted by shared resources may be:
One aspect of the use of shared resources is the stall time of processing units while waiting for answers to requests sent to shared resources. In order for a processing unit to abandon one currently pending task, change to another and then return to the first, it is conventional to provide context switching, that is to store the contents of registers of the processor element. An aspect of the present invention is the use of hardware accelerated context switching. This also allows a processor core to be used for the processing element which is not provided with its own hardware switching facility. This hardware is preferably provided in each processing node, e.g. in the form of a communication engine. Each processing unit maintains a pool of packets to be processed. When a request to a shared resource is issued, a processing element of the relevant processing unit switches context to another packet, until the answer on the request has arrived. One aspect of the present invention is to exploit packet processing parallelism in such a way that the processing units can be used as efficiently as possible doing useful processing, thus avoiding waiting for I/O (input/output) operations to complete. These I/O operations are, for example, requests to shared resources or copying packet information in and out of the processing element. The present invention relies in part on the fact that typically there is little useful context, or useful context can be reduced to a minimum by judicious task programming, when a shared resource request is launched in a network element of a packet switched telecommunications network. Switching to process another packet does not necessarily require saving the complete state of a processing element. The judicious programming can include organizing the program to be run on each processing element as a sequence of function calls, each call having a context when run on a processing element but requiring no interfunction calls. The exception is context provided by the data in the packet itself or in a part of the packet.
Returning to
After splitting, the head is then supplied to one of the processing pipelines, while the tail is stored into a memory such as a FIFO 9. Each packet is preferably assigned a sequence number by the sequence number assigning module 15. This sequence number is copied into the head as well as into the tail of each packet and stored. It may be used for three purposes:
The sequence number can be generated, for example, by a counter included in the packet splitting and sequence number assigning means 15. The counter increments with each incoming packet. In that way, the sequence number can be used to put packets in a specific order at the end of the pipelines.
An overhead generator is provided in the packet dispatcher 2 or more preferably in the packet buffer 4a, 5a, 6a generates new/additional overhead for each head and/or tail. After the complete head has been generated, the head is sent to one of the pipelines 4-6 that has buffer space available. The tail is sent to the tail FIFO 9.
In accordance with an embodiment of the present invention, the added overhead includes administrative data in both the head and/or the tail. A process flow is shown schematically in
b shows an alternative set of actions performed on a packet within the processing apparatus. Each head processed by the pipeline may be preceded by a scratch area which can be used to store intermediate results. It may also be used to build a packet descriptor which can be used by processing devices downstream of the packet processing unit. The packet buffer 4a; 5a, 6a at the beginning of each pipeline can add this scratch area to the packet head. The packet buffer 4f, 5f, 6f at the end removes it (at least partially), as shown in
It is one aspect of the present invention that the head when it is in the pipeline includes a reference to a task to be performed by the current and/or the next processing unit. In this way a part of the context of a processor element is stored in the head. That is, the current version of the HAF in a head is equivalent to the status of the processing including an indication of the next process to be performed on that head. The head itself may also store in-line data, for example intermediate values of a variable can be stored in the scratch area. All information that is necessary to provide a processing unit with its context is therefore stored in the head. When the head is moved down the pipeline, the context moves with the head in the form of the data stored in the relevant parts of the head, e.g. HAF, scratch area. Thus, a novel aspect of the present invention is that the context moves with the packet rather than the context being static with respect to a certain processor.
The packet reassembly module 3 reassembles the packet heads coming from the processing pipelines 4-6 and the corresponding tails coming from the tail FIFO 9. Packet networks may be divided into those in which each packet can be routed independently at each node (datagram networks) and those in which virtual circuits are set up and packets between a source and a destination use one of these virtual circuits. Thus, depending upon the network there may be differing requirements on packet sequencing. The reassembly module 3 assures packets leave in the order they arrive or, alternatively, in any other order as required. The packet reassembly module 3 has means for keeping track of the sequence number of the last packet sent. It searches the outputs of the different processing pipelines for the head having a sequence number which may be sent, as well as the end of the FIFO 9 to see which tail is available for transmission, e.g. the next sequence number. For simplicity of operation it is preferred if the packets are processed in the pipelines strictly in accordance with sequence number so that the heads and their corresponding tails are available at the reassembly module 3 at the same time. Therefore, it is preferred if means for processing packets in the pipelines strictly in accordance with sequence number are provided. Then, after the appropriate head is propagated to the output of the pipeline, it is added in the reassembly module 3 to the corresponding tail, which is preferably the first entry in the tail FIFO 9 at that moment. The reassembly unit 3 or the egress packet buffer 4f, 5f, 6f removes the remaining HAF and other fields from the head.
When a packet must be dropped, a processing unit has a means for setting an indication in the head that a head is to be dropped, e.g. it can set a Drop flag in the packet overhead. The reassembly module 3 is then responsible for dropping this head and the corresponding tail.
One pipeline 4 in accordance with an embodiment of the present invention is shown schematically in
The communication engines communicate with each other for transferring heads. Thus, when each communication engine is ready to receive new data, a ready signal is sent to the previous communication engine or other previous circuit element.
In accordance with an embodiment of the present invention, as shown schematically in
The HAF contains packet information (length), and the processing status as well as containing part of the “layer2” information, if present (being at least, for instance, a code indicating the physical interface type and a “layer3” protocol number).
A communication module in accordance with an embodiment of the present invention may comprise the dispatch unit 2, the packet assembly unit 3, the memory 9, the communication engines 1b . . . d, the dual port RAM 7b-d, optionally the packet buffers as well as suitable connection points to the processing units and to the shared resources. When the communications module is provided with its complement of processing elements a functioning packet processing apparatus is formed.
A processing unit in accordance with an embodiment of the present invention is shown schematically in
A processing element 14b in accordance with the present invention can efficiently be implemented using a processing core such as an Xtensa® core from Tensilica, Santa Clara, Calif., USA. A processing core with dedicated hardware instructions to accelerate the functions that will be mapped on this processing element make a good trade-off between flexibility and performance. Moreover, the needed processing element hardware support can be added in such a processor core, i.e. the processor core does not require context switching hardware support. The processing element 14b is connected to the communication engine 11b through a system bus 20b—resets and interrupts may be transmitted through a separate control bus (best shown in
In accordance with an aspect of the present invention processing elements are synchronized in such a way that the buffers 37a-h do not over- or underflow. Processing of a head is done in place at a processing element. Packets are removed from the system as quickly as they arrive so processing will never create the need for extra buffer space. So, a processing element should not generate a buffer overflow. Processing a packet can only be started when enough data are available. The hardware (communication engine) suspends the processing element when no heads are eligible for processing. The RAM 7b . . . 7d provides buffer storage space and allows the processing elements to be decoupled from the processing pace of the pipeline.
Each processing element can decide to drop a packet or to strip a part of the head or add something to a head. To drop a packet, a processing element simply sets the Drop flag in the HAF. This will have two effects: the head will not be eligible anymore for processing and only the HAF will be transferred to the next stage. When the packet reassembler 3 receives a head having the Drop bit set, it drops the corresponding tail.
The HAF has an offset field which indicates the location of the first relevant byte. On an incoming packet, this will always be equal to zero. To strip a part of the head at the beginning, the processing element makes the Offset flag point to the first byte after the part to be stripped. The communication engine will remove the part to be stripped, realign the data to word boundaries, update the Length field in the HAF, and put the offset field back to zero. This is shown in
The dispatching unit 2 can issue a Mark command by writing a non-zero value into a Mark register. This value will be assigned to the next incoming packet, i.e. placed in the head. When the reassembly unit 3 issues a command for this packet (at that moment the head is completely processed), the mark value can result in generation of an interrupt. One purpose of marking a packet is when performing table updates. It may be necessary to know when all packets received before a certain moment, have left the pipelines. Such packets need to be processed with old table data. New packets are to be processed with new table data. Since packet order remains unchanged through the pipelines, this can be accomplished by marking an incoming packet. In packet processing apparatus in which the order is not maintained, a timestamp may be added to each head instead of a mark to one head. Each head is then processed according to its timestamp. This may involve storing two versions of table information for an overlap time period.
Each processing element has access to a number of shared resources, used, for example, for a variety of tasks such as lookups, policing and statistics. This access is via the communications engine associated with each processing element. A number of buses 8a-f are provided to connect the communication engines to the shared resources. The same buses 8a-f are used to transfer the requests as well as the answers. For example, each communication engine 11b is connected to such a bus 8 via a Shared Resource Bus Interface 24b (SRBI—see
The communication engine 11b is preferably the only way for a processing element to communicate to resources other than its local memory 13b. The communication engine 11b is controlled by the host processing element 14b via a control interface. The main task of the communication engine 11b is to transfer packets from one pipeline stage to the next one. Besides this, it implements context switching and communication with the host processing element 14b and shared resources.
The communication engine 11b has a receive interface 22b (Rx) connected to the previous circuit element of the pipeline and a transmit interface 23b (Tx) connected to the next circuit element in the pipeline. Heads to be processed are transmitted from one processing unit to another via the communications engines and the TX and RX interfaces, 22b, 23b. If a head is not to be processed in a specific processing unit it can be provided with a tunneling field which defines the number of processing units to be skipped.
Each transmit/receive interface 22b, 23b of a communication engine 11b which is receiving and transmitting at the same time, can only access the data memory 7 during less than 50% of the clock cycles. This implies that the effective bandwidth between two processing stages is less than half the bus bandwidth. As long as the number of pipelines is greater than two, this is sufficient. However, the first pipeline stage has to be able to sink bursts at full bus speed when a new packet head enters the pipeline. In a similar way, the last pipeline stage must be able to produce a packet at full bus speed. The ingress packet buffer 4a, 5a, 6a is responsible to equalize these bursts. The ingress packet buffer receives one packet head at bus speed and then sends it to the first processor stage at its own speed. During that period, it is not able to receive a new packet head. The egress packet buffer 4f, 5f, 6f receives a packet head from the last processor stage. When received, it sends the head to the packet reassembly unit 3 at bus speed. The ingress packet buffer can have two additional tasks:
The egress packet buffer has one additional task:
A number of hardware extensions are included in accordance with the present invention to help the FIFO management:
As indicated above in an aspect of the present invention hardware, such as the communication engine, may be provided to support a very simple multitasking scheme. A “context switch” is done; for example, when a process running on a processing element has to wait for an answer from a shared resource or when a head is ready to be passed to the next stage. The hardware is responsible for selecting a head that is ready to be processed, based on the HAF. Packets are transferred from one stage to another via a simple ready/available protocol or any other suitable protocol. Only the part of a buffer that contains relevant data is transferred. To achieve this the head is modified to contain the necessary information for directing the processing of the heads. In accordance with embodiments of the present invention processing of a packet is split up into a number of tasks. Each task typically handles the response to a request and generates a new request. A pointer to the next task is stored in the head. Each task first calculates and then stores the pointer to the next task. Each packet has a state defined by Done and Ready represented by two bits in various combinations. They have following meaning:
From a buffer management point of view, buffers containing a packet can be in three different states:
The communication engine maintains the packet state, e.g. by storing the relevant state in a register, and also provides packets in the Ready4Processing state to the processor with which it is associated. After being processed, a packet is in the Ready4Next or Waiting state. In the case of the Ready4Next state, the communication engine will transmit the packet to the next stage. When in the Waiting state, the state will automatically be changed by the communication engine to the Ready4Processing or Ready4Next state when the shared resource answer arrives.
The communication engine is provided to select a new packet head. The selection of a new packet head is triggered by a processing element, e.g. by a processor read on the system bus. A Current Buffer pointer is maintained in a register, indicating the current packet being processed by the processing element.
A schematic representation of a communication engine in accordance with one embodiment of the present invention is shown in
Buffer Management:
The five functions described above have been represented as four finite state machines (FSM, 32, 33, 34a, 34b) and a buffer manager 28 in
Main data structures (listed after most involved task) handled by the communication engine are:
Others parts of the Communication Engine are:
Each processor element 14 has access to a number of shared resources, used for lookups, policing and statistics. A number of buses 8 are provided to connect processing elements 14 to the shared resources. The same bus 8 may be used to transfer the requests as well as the answers. Each communication engine 11 is connected to such a bus via a Shared Resource Bus Interface 24 (SRBI).
Each communication engine 11 maintains a number of packet buffers 37a-h. Each buffer can contain one packet, i.e. has means for storing one packet. With respect to packet reception and transmission, the buffers are dealt with as a FIFO, so packet order remains unaltered. Packets enter from the RX Interface 22 and leave through the TX Interface 23. The number of buffers, buffer size and the start of the buffer area in the data memory 7 are configured via the control interface 26. Buffer size is always a power of 2, and the buffer start is always a multiple of the buffer size. In that way, each memory address can easily be split up in a buffer number and an offset in the buffer. Each buffer can contain the data of one packet. A write access to a buffer by a processing element 14 is monitored by the communication engine 11 via the monitoring bus 18 and updates the buffer state in a buffer state register accordingly. A buffer manager 28 maintains four pointers in registers 35, two of them pointing to a buffer and two of them pointing to a specific word in a buffer:
For each buffer, a state is maintained in buffer state registers 30. Each buffer is in one of the following five states:
Besides a state, a WaitingLevel is maintained for each buffer in the registers 35. A WaitingLevel different from zero, indicates that the packet is waiting for some event, and should not be handed over to the processor, nor transmitted. Typically, WaitingLevel represents the number of ongoing shared resource requests. After reset, all buffers are in the Empty state. When a packet is received completely, the state of the buffer where it was stored, is updated to ReadyForProcessing state for packets that need to be processed, or to the ReadyForTransfer state for packets that need no processing (e.g. dropped packets). The WaitingLevel for a buffer is set to zero on any incoming packet.
After processing a packet, the processor 14 updates the buffer state of that packet, by writing the Transfer and SRRequest bit into the HAF, i.e. into the relevant buffer of the dual port RAM 7. This write is monitored by the communication engine 11 via the monitoring bus 18. The processor 14 can put a buffer in a ReadyForProcessing or ReadyFor-Transfer state if there are no SR requests to be sent, or to the ReadyForTransferWSRPending or ReadyFor-ProcessingWSRPending states if there are requests to be sent. From the ReadyForTransferWSRPending or ReadyForProcessingWSRPending states, the buffer state returns to ReadyForTransfer or ReadyForProcessing as soon as all requests are transmitted. When the ReadPointer reaches the start of a new buffer, it waits until that buffer gets into the Ready-ForTransfer state and has WaitingLevel equal to zero, before reading and transmitting the packet. As soon as the transmission starts, the buffer state is set to Empty. This guarantees that the packet cannot be selected anymore. (Untransmitted data can not be overwritten even if the buffer is in the Empty state, because the WritePointer will never pass the ReadPointer).
As long as there are empty buffers, incoming data are accepted from the RX interface. The buffer area is full when WritePointer reaches ReadPointer (an extra flag is needed to make the distinction between full and empty, since in both conditions, ReadPointer equals WritePointer).
Packet transmission is triggered when the buffer ReadPointer points to, gets into the ReadyForTransfer state and has a WaitingLevel of zero. First, the buffer state is set to Empty, Then, the HAF and the scratch area are read from the RAM and transmitted. The words that contain only overhead to be stripped are skipped. Then the rest of the packet data is read and realigned before transmission, such that the remaining overhead bytes in the first word are removed. However if a packet has its Drop flag set, the packet data is not read. After a packet is transmitted, ReadPointer jumps to the start of the next buffer.
The communication engine maintains the CurrentBuffer pointer, pointing to the buffer of the packet currently being processed by the processing element. An associated Valid flag indicates that the content of Current-Buffer is valid. If the processor is not processing any packet, the Valid flag is set to false. Five different algorithms are provided to select a new buffer:
When a processor has finished processing a buffer, it specifies what the next task is that has to be done on the packet. This is done by writing the following fields in the packets HAF:
The Transfer and SRRequest bits are not only written into the memory, but also monitored by the communication engine via the XLMI interface. This is used to update the buffer state:
The communication engine 11 provides a generic interface 24 to shared resources. A request consists of a header followed by a block of data sent to the shared resource. The communication engine 11 generates the header in the SRTX 34a, but the data has to be provided by the processor 14. Depending on the size and nature of the data to be sent, three ways of assembling the request can be distinguished:
The SR RequestID may contain the following fields:
After putting the RequestID's in the buffer memory 7, the processor indicates the presence of these IDs by setting the SRRequest bit in the HAF (this is typically done when the HAF is updated for the next task).
When the processor releases a buffer (by requesting a new one), the SRRequest bit in the HAF is checked. This can be done by evaluating the buffer state. If set, the buffer number of this packet is pushed into a small FIFO, the SRRequest FIFO. When this FIFO is full, the Idle task is returned on a request for a new packet, to avoid overflow. The SR TX state machine 34a (
Whenever a non-command request is transmitted, the WaitLevel field is incremented by one. When a reply is received, it is decremented by one. When all requests are transmitted, the buffer state is set to ReadyForTransfer (when coming from ReadyForTransferWSRPending) or ReadyForProcessing (when coming from ReadyForProcessingWSRPending). This mechanism guarantees that a packet can only be transmitted or processed (using the Next/First-ProcessablePacket algorithm) not earlier than the moment where
The destination address of a reply is decoded by the shared resource bus socket. Replies that match the local address are received by the communication engine over the SRBI RX interface 24b. The reply header contains a buffer number and offset where the reply has to be stored. Based on this, the communication engine is able to calculate the absolute memory address. The data part of the reply is received from the SRBI bus 8 and stored into the data memory 7. When all data are stored, the success bits (see below) are updated by performing a read-modify-write on the HAF in the addressed buffer, and finally the WaitLevel field of that buffer is decremented by one.
Some of the shared resource requests can end with a success or failure status (e.g. Exact Match resource compares an address to a list of addresses. A match returns an identifier, no match returns a failure status). Means are added to propagate this to the HAF of the involved packet. A number of bits, e.g. five, are provided in the HAF which can catch the result of different requests. Therefore it is necessary that a RequestID specifies which of the five bits has to be used. Shared resources can also be put in a chain, i.e. the result of a first shared resource is the request for a second shared resource and so on. Each of these shared resources may have a success or failure status and thus may need its own success bit. It is important to note that the chain of requests is discontinued when a resource terminates with a failure status. In that case the failing resource sends its reply directly to the originating communication engine.
While processing a packet, the processing element 14 associated with a communication engine 11 can make the communication engine 11 issue one or more requests to the shared resources, by writing the necessary RequestID's into the relevant packet's buffer. Each RequestID is, for example, a single 64 bit word, and will cause one shared resource request to be generated. Replies from a shared resource are also stored in the packet's buffer. The process of assembling and transmitting the requests to shared resources is preferably started when the packet is not being processed any more by the processor. The packet can only become selectable for processing again after all replies from the shared resources have arrived. This guarantees that a single buffer will never be modified by the processor and the communication engine at the same time.
A shared resource request is invoked by sending out the request information together with information for the next action from a processing element to the associated communications engine. This is a pointer identifying the next action that needs to be performed on this packet, and an option to indicate that the packet needs to be transferred to the next processing unit for that action. Next, the processing unit reads the pointer to the action that needs to be performed next. This selection is done by the same dedicated hardware, e.g. the communication engine, which regulates the copying of heads into and out of the buffer memory 7 for the processing element relating to the processing unit. To this extent, the communication engine also processes the answers from the shared resources. A request to a shared resource preferably includes a reference to the processing element which made the request. When the answer returns from the shared resource, the answer includes this reference. This allows the receiving communication engine to write the answer into the correct location into the relevant head in its buffer. Subsequently, the processing element jumps to the identified action. In this way, the processing model is that of a single thread of execution. There is no need for an expensive context switch that needs to save all processing element states, an operation that may either be expensive in time or in hardware. Moreover, it trims down the number of options for the selection of such a processing element. The single thread of execution is in fact an endless loop of:
The rigid definition of this programming model allows a verification of the programming code of the actions performed on the packets on a level which does not need to include the detail of these timing and latency figures.
A further embodiment of the present invention relates to how the shared resources are accessed. Processing units and shared resources are connected via a number of busses, e.g. double 64 bit wide busses. Each node (be it a processing unit or a shared resource) has a connection to one or more of these busses. The number of busses and number of nodes connected to each bus are determined by the bandwidth requirements. Each node preferably latches the bus, to avoid long connections. This allows a high speed, but also a relative high latency bus. All nodes have the same priority and arbitration is accomplished in a distributed manner in each node. Each node can insert a packet whenever an end of packet is detected on the bus. While inserting a packet, it stalls the incoming traffic. It is assumed that this simple arbitration is sufficient when the actual bandwidth is not too close to the available bandwidth, and latency is less important. The latter is true for the packet processor, and the former can be achieved by a good choice of the bus topology.
The shared resources may be connected to double bit wide busses as shown schematically in
Instead of using the bus type shown in
From the above the skilled person will appreciate that a packet entering a processing pipeline 4-6 triggers a chain of actions which are executed on that processing pipeline for that packet. An action is defined as a trace of program (be it in hardware or in software) code that is executed on a processing element during some amount of clock cycles without interaction with any of the shared resources or without communication with the next processing element in the pipeline. An action ends on either a request to a shared resource, or by handing over the packet to the next stage. This sequence of actions, shared resource requests and explicit packet hand-overs to the next stage is shown schematically in
A flow diagram of the processing of a packet by a processing unit in a pipeline is shown schematically in
In step 100, a new packet head is presented at the receive port of a communications engine and if there is a free (empty) buffer location in the memory, the packet head is received and the status of free buffers is accessed via the buffer manager. If a free buffer exists, the head data is sent in step 102 to the memory and stored in step 104 in the appropriate buffer, i.e. at the appropriate memory location. In step 106 the buffer state in the buffer state register is updated by the communication engine from empty to R4P if the head is to be processed (or R4T for packet heads that do not require processing, e.g. dropped and tunneled packet heads). As older packet heads in the buffers are processed and sent further down the pipeline, after some time, the current R4P packet head is ready to be selected.
In step 108, the processing element finishes processing of a previous head and requests a next packet head from the communications engine. The next packet selection is decided in step 110 on the basis of the buffer states contained in the buffer state register. If no R4P packet heads are available then idle is returned by the communications engine to the processor. The processing element will request the same again until a non-idle answer is given.
In step 114 the communications engine accesses the next packet register and sends the next packet head location and the associated task pointer is sent to the processing element. In order for the processing element to get started right away, not only the next packet head location is provided in the answer, also the associated task pointer is given. This data is part of the HAF of the next packet head to be processed and hence requires the cycle(s) of a read to memory. Therefore the communication engine continuously updates in step 112 the new packet register with a packet head location+task pointer tuple so as to have this HAF read take place outside the cycle budget of the processing element.
In step 116, the processing element processing the packet head and updates the HAF fields ‘Transfer’ and ‘SRRequest’. The communications engine monitors the data bus and on the basis of this bus monitoring between the processing element and the memory, the buffer state manager is informed to update the buffer state in step 118. For instance, a head can become R4P or R4T if no SR requests are to be sent or R4PwSRPending or R4TwSRPending if SR request are to be sent.
In step 120 the pending SR request triggers the SR transmit machine after the processing phase to assemble and transmit the SR requests that are listed at the end of the buffer, i.e. the requestIDs list. In step 122 the request IDs are processed in sequence. The indirect type requests require reads from memory. In step 124, for every request that expects an answer back, as opposed to a command, the WaitingLevel counter is increased.
In step 126, upon receipt of an SR answer, the SR receive machine processes the result, and writes in step 128 writing to the memory, more specifically to the buffer location associated with the appropriate packet head. In step 130 the waitingLevel counter is decreased.
Eventually when all requests are transmitted and all replies are received a packet head is set to R4P or R4T in step 132. A first-in-first out approach is taken for the packet head stream in the buffers. In step 134, when the oldest present packet head becomes ‘R4T’ then the transmit machine will output this packet head to the transmit port.
The processing pipelines in accordance with the present invention meet the following requirements:
| Number | Date | Country | Kind |
|---|---|---|---|
| 0209670.9 | Apr 2002 | GB | national |
| Filing Document | Filing Date | Country | Kind | 371c Date |
|---|---|---|---|---|
| PCT/US03/14259 | 4/25/2003 | WO | 10/20/2004 |