The present invention relates to high speed switching of data packets in general and, is more particularly concerned with a system and a method that permit to handle multicast traffic, concurrently with the unicast traffic, in a switch fabric that collapses all ingress port adapter virtual output queues (VOQ's) in its switching core while allowing an efficient flow control.
The explosive demand for bandwidth over all sorts of communications networks has driven the development of very high-speed switch fabric devices. Those devices have allowed the practical implementation of network nodes capable of handling aggregate data traffic in a large range of values i.e., with throughputs from a few gigabit (109) to multi-terabit (1012) per second. To carry out switching at network nodes, today's preferred solution is to employ, irrespective of the higher communications protocols actually in use to link the end-users, fixed-size packet (or cell) switching devices. These devices, which are said to be protocol agnostic, are considered to be simpler and more easily tunable for performances than other solutions especially, those handling variable-length packets. Thus, N×N switch fabrics, which can be viewed as black boxes with N inputs and N outputs, have been made capable of moving short fixed-size packets (typically 64-byte packets) from any incoming link to any outgoing link. Hence, communications protocol packets and frames need to be segmented in fixed-size packets while being routed at a network node. Although short fixed-size packet switches are thus often preferred, the segmentation and subsequent necessary reassembly (SAR) they assume have a cost. Switch fabrics that handle variable-size packets are thus also available. They are designed so that they do not require or limit the amount of SAR needed to route higher protocol frames.
Whichever type of packet switch is considered they have however in common the need of an efficient flow control mechanism which must attempt to prevent all forms of congestion. To this end, all modern packet switches use a scheme referred to as ‘virtual output queuing’ or VOQ. As sketched in
Organizing input queuing as a VOQ has the great advantage of preventing any form of ‘head of line’ or HoL blocking. HoL blocking is potentially encountered each time incoming traffic, on one input port, has a packet destined for a busy output port, and which cannot be admitted in the switch core, because flow control mechanism has determined it is better to do so e.g., to prevent an output queue (OQ) such as (130) from over filling. Hence, other packets waiting in line are also blocked since, even though they would be destined for an idle output port, they just cannot enter the switch core. To prevent this from ever occurring, IA's input queuing is organized as a VOQ (115). Incoming traffic on each input port i.e., in each IA, is sorted per port destination (125) and in general per class of service or flow-ID, so that if an output port is experiencing congestion, traffic for other ports, if any, can be selected instead thus, has not to wait in line.
This important scheme of switch fabrics which authorizes input queuing without its drawback i.e., HoL blocking, was first introduced by Y. Tamir and G. Frazier, “High performance multi-queue buffers for VLSI communication switches,” in Proc. 15th Annu. Symp. Comput. Arch., June 1988, pp. 343-354. It is universally used in all kinds of switch fabrics that rely on input-queuing and is described, or simply assumed, in numerous publications dealing with this subject. As an example, a description of the use of VOQ and of its advantages can be found in “The iSLIP Scheduling Algorithm for Input-Queued Switches” by Nick McKeown, IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, April 1999.
The implementation of a packet switching function brings a difficult challenge which is the overall control of all the flows of data entering and leaving it. Whichever method is adopted for flow control, this always assumes that packets can be temporarily held at various stages of the switching function so as to handle flows on a priority basis thus supporting QoS (Quality of Service) and preventing switch to get congested. VOQ scheme fits well with this, allowing packets to be preferably held in input queues i.e., in IA's (100), before entering the switch core (110) while not introducing any blocking of higher priority flows.
As an example of this,
This scheme works well as long as the time to feed the information back to the source of traffic i.e., the VOQ's of IA's (100), is short when expressed in packet-times. However, packet-time reduces dramatically in the most recent implementation of switch fabrics where the demand for performance is such that aggregate throughput must be expressed in tera (1012) bits per second. As an example, packet-time can be as low as 8 ns (nanoseconds i.e.: 10−9 sec.) for 64-byte packets received on OC-768 or 40 Gbps (109 bps) switch port having a 1.6 speedup factor thus, actually operating at 64 Gbps. As a consequence, round trip time (RTT) of the flow control information is far to be negligible as this used to be the case with lower speed ports. As an example of a worst case traffic scenario, all input ports of a 64-port switch may have to forward packets to the same output port eventually creating a hot spot. It will take RTT time to detect and block the incoming traffic in all VOQ's involved. If RTT is e.g.: 16 packet-times then, 64×16=1024 packets may have to accumulate for the same output in the switch core. A RTT of 16 packet-times corresponds to the case where, for practical considerations and mainly because of packaging constraints, distribution of power, reliability and maintainability of a large system, port adapters cannot be located in the same shelf and have to interface with the switch core ports through cables. Then, if cables (150) are 10 meter long, because light is traveling at 5 nanoseconds per meter, it takes 100 nanoseconds or about 12 packet-times (8 ns) to go twice through the cables. Then, adding the internal processing time of the electronic boards, including the multi-Gbps serializer/deserializer (SERDES), the this may easily add up to the 16 packet-times used in the above example.
Hence, when the performance of a large switching equipment approaches or crosses the 1 Tbps level, typically with 40 Gbps (OC-768) ports, RTT expressed in packet-time is becoming too high to continue to use a standard or backpressure flow control mechanism such as the ones briefly discussed in
The above however refers primarily to the case of the unicast traffic. That is, when incoming packets need to be forwarded to only one destination or output port e.g., (155). It is as well important to be able to handle efficiently multicast traffic i.e., traffic that arrives from an ingress port and which must be dispatched on more than one output port in any combination of 2 to N ports.
Multicast traffic is becoming increasingly important with the development of networking applications such as video-distribution or video-conferencing. Multicast has traditionally been an issue in packet switches because of the intrinsic difficulty to handle all combinations of destinations without any restriction. As an example, with a 16-port fabric there are possibly 216-17 combinations of multicast flows i.e., about 65 k flows. This number however reaches four billions of combinations with a 32-port switch (232-33). Even though it is never the case that all combinations need and can be used simultaneously there must be, ideally, no restrictions in the way multicast flows are allowed to be assigned to output port combinations for a particular application. As illustrated on
Therefore, there is a need to be able to support MC traffic, from a single MC queue, part of a VOQ organized ingress adapter, to a switch core of a kind aimed at solving the problems raised by the back-pressure type of switch core of previous art thus, implementing a collapsed virtual output queuing mechanism or cVOQ and this without any design restriction on the way output ports can be freely assigned to the multicast flows.
Thus, it is a broad object of the invention to remedy the shortcomings of the prior art as described here above.
It is another object of the invention to provide a system and a method to prevent any form of packet traffic congestion in a shared-memory switch core, adapted to handle multicast traffic.
It is a further object of the invention to permit that an absolute upper bound on the size of the shared-memory, necessary to achieve this congestion-free mode of operation, be defineable irrespective of any incoming traffic type.
It is still another object of the invention to further reduce the above necessary amount of shared-memory of the switch core, while maintaining a congestion-free operation and without impacting performances, by controlling the filling of the shared-memory and keep data packets flowing up to the egress port adapter buffers.
The accomplishment of these and other related objects is achieved by a method for switching unicast or multicast data packets in a shared-memory switch core, from a plurality of ingress port adapters to a plurality of egress port adapters, each of said ingress port adapters including an ingress buffer comprising at least one virtual output queue per egress port to hold incoming unicast data packets and one virtual output queue to hold incoming multicast data packets, each of said ingress port adapters being adapted to send a transmission request when a data packet is received, to store said data packet, and to send a data packet referenced by a virtual output queue when an acknowledgment corresponding to said virtual output queue is received, said method comprising the step of,
Further advantages of the present invention will become apparent to the ones skilled in the art upon examination of the drawings and detailed description. It is intended that any additional advantages be incorporated herein.
Instead of relying on a feedback from switch core (210) to stop forwarding traffic in case of congestion, thus carrying out a backpressure mechanism as discussed in the background section with
Although, for a sake of clarity,
Therefore, sending a unicast or multicast request (207) to switch core for each unicast or multicast arriving packet allows to keep track of the state of all VOQ's within a switch fabric. This is done e.g., under the form of an array of counters (260). Each individual counter (262) is the counterpart of an IA queue like (225). On reception by switch core of a unicast or multicast request, that carries the reference of the queue to which it belongs in IA (200), the corresponding counter (262) is incremented so as to record how many packets are currently waiting to be switched. This process occurs simultaneously, at each packet cycle, from all IA's (200) that have received a packet (205). There is thus possibly up to one request per input port at every packet cycle to be processed. As a consequence of above, the array of counters (260) collapses the information of all VOQ's i.e., from all IA's, in a single place. Hence, switch core gains a complete view of the incoming traffic to the switch fabric.
Collapsing all VOQ's in switch core, that is implementing a collapsed virtual output queuing array (cVOQ), allows to return unicast or multicast acknowledgments (240) to all IA's that have at least one non-empty queue. On reception of a unicast or multicast acknowledgment, IA's may unconditionally forward the corresponding waiting unicast or multicast packets e.g., to a shared memory (212) as in previous art from where they will exit the switch core through one (UC flows) or more (MC flows) output ports to an egress adapter (not shown). Issuing acknowledgments from switch core, eventually triggering the sending of a packet from an IA, allow to decrement the corresponding counter so as to keep IA's VOQ and collapsed switch core VOQ, in synch. Hence, because all information is now available in a single place, the collapsed VOQ's in the switch core referred to as cVOQ in the following description, a comprehensive choice can be exercised on what is best to return to IA's at each packet cycle to prevent switch core to become congested and shared memory to overflow thus, maintaining in ingress queues the packets that cannot be scheduled to be switched.
According to what has just been exposed, each MC packet (205), while being queued in IA, triggers the sending of a MC request to the switch core. MC requests simply carry a flag allowing the switch core (210) to distinguish between a unicast request and a multicast one. Like UC requests, which are used to increment counters (265) per destination and are related to one IA, MC requests are used to increment a specific multicast counter (270) per IA. Thus, within the switch core, there are as many MC counters (270) as there are Ingress Adapters (200). Similarly to unicast counters, MC counters collapse the MC queues of all IA's allowing the switch core to get a global view of all multicast requests (UC+MC).
This mode of operation is to compare with the flow control of the previous art (
It is worth noting here that the invention rests on the assumption that an unrestricted amount of requests (in lieu of real packets) can be forwarded to the switch core thus, without necessitating a backpressure on the requests. This is indeed feasible since counters are used for requests instead of real memory to store packets. Doubling the size of a counter requires only one more bit to be added while this requires to double the size of the memory if packets were admitted in switch core as it is the case with a backpressure mode of operation. Hence, because hardware resources needed to implement the new mode of operation grows only as the logarithm of the number of requests to handle, this is indeed feasible. On a practical point of view, each cVOQ individual counter should be large enough to count the total number of packets that can be admitted in an ingress adapter. This takes care of the worst case where all waiting packets belongs to a single queue. Typically, ingress adapters are designed to hold each a few thousand packets. For example, 12-bit (4 k) counters may be needed in this case. There are other considerations to limit the size of the counters like the maximum number of packets to be admitted in IA's for a same destination.
Immediately upon arrival of packet, a request (307) is issued to switch core (310). Request needs to travel through cable (350) and/or wiring on the electronic board(s) and backplane(s) used to implement the switching function. Adapter port to switch core link may also use one or more optical fibers in which case there may have also opto-electronic components on the way. Request eventually reaches the switch core so as this later is informed that one more packet is waiting in IA. In a preferred mode of implementation of the invention this results in the increment of a binary counter associated to the corresponding queue i.e., individual counter (362) part of the set of counters (360) that collapse all VOQ's of all ingress adapters as described in previous figure.
Then, the invention assumes there is a mechanism in switch core (365) which selects which ones of pending requests should be acknowledged. No particular selection mechanism is assumed by the invention for determining which IA queues should be acknowledged first. This is highly dependent on a particular application and expected behavior of the switching function. Whichever algorithm is however used only one acknowledgment per output port, such as (342), can possibly be sent back to its respective IA at each packet cycle. Thus, algorithm should tend to always select one pending request per EA (if any is indeed pending for that output i.e., if at least one counter is different from zero) in cVOQ array (360) in order not to waste the bandwidth available for the sending of acknowledgments to the IA's. When several adapters have waiting packets for the same output—there are several non-zero counters in the column corresponding to one egress port (355)—it is always possible to exercise, in a column, the best choice e.g., to select the adapter which has the highest priority packet waiting to be switched. This must be compared to the backpressure mode of operation of the previous art, described in
Acknowledgments, such as (342), are thus for a given output port in the case of a unicast packet, or for any output port in the case of a multicast packet. More generally, they are defined on a per flow basis as discussed earlier. As a consequence, an IA receiving such an acknowledgment unambiguously knows which one of the packets waiting in the buffer (315) should be forwarded to switch core. It is the one situated at head of line of the queue referenced by the acknowledgment, whatever is the type of traffic, unicast or multicast. The corresponding packet is thus retrieved in buffer and immediately forwarded (322). Because the switch core request selection process has a full view of all pending requests and also knows what resources remain available in the switch core no acknowledgment is sent back to an IA if the corresponding resources are exhausted. This translates, in a shared memory such as (312), by the fact that there must have enough room left before authorizing the forwarding of a corresponding acknowledgment. Also, in such a mode of operation, there is no need to bring into the switch core too many packets for a same output port. There must just have enough packets for every output port so that the switch is said to be work-conserving. In other words a maximum of RTT packets, per output, should be brought in shared-memory if the corresponding input traffic indeed permits it. This is sufficient to guarantee that packets can continuously flow out of any port so that no cycle is ever wasted (while one or more packets would be unnecessarily waiting to be processed in ingress adapter). Having RTT packets to be processed by each core output port leave enough time to send back an acknowledgment and receive a new packet on time. If, as in example of the background section, RTT is 16 packet-times and switch core has 64 ports the shared memory (312) needs to be able to hold a maximum of 64×16=1024 packets. Indeed, if no adapter is located more than 16 packet-times apart from switch core shared memory cannot overflow and a continuous flow of packets can always be sustained to a port receiving 100% of aggregate traffic from a single input port or in any mix of 1 to 64 ports in this example.
A consequence of the mode of operation according to the invention is that it takes always two RTTs to switch a packet (i.e.: 2×16×8=256 ns with 8-ns packets) because a request is first sent and actual packet only forwarded upon reception of an acknowledgment. Hence, this allows to control exactly the resources needed for implementing a switch core irrespective of any traffic scenario. As shown in here above example the size of the shared memory is bounded by the back and forth travel time (RTT) between adapters and switch core and by the number of ports.
No packet is ever admitted in switch core unless it is guaranteed to be processed in RTT time.
Although many alternate ways are possible, including to have dedicated links and I/O's to this end, a preferred mode of implementation is to have the requests and acknowledgments carried in the header of the packets that are continuously exchanged between adapters and switch core (i.e., in-band). Indeed, in a switch fabric of the kind considered by the invention numerous high speed (multi-Gbps) links must be used to implement the port interfaces. Even though there is no traffic through a port at a given instant, to keep links in synch and running, idle packets are exchanged instead when there is no data to forward or to receive. Whichever packets are ‘true’ packets i.e., carrying user data, or are idle packets, they are comprised of a header field (400) and a payload field (410) this later being significant, as data, in the user packet only. There is also, optionally, a trailing field (420) to check the packet after switching. This takes the form of a FCS (Field Check Sequence) generally implementing some sort of CRC (Cyclic Redundancy Checking) for checking packet content. Obviously idle packets are discarded in the destination device after the header information they carry is extracted.
Hence, there is a continuous flow of packets in both directions, idle or user packets, on all ports between adapters and switch core. Their headers can thus carry the requests and acknowledgments in a header sub-field e.g.: (430). Packets entering the switch core thus carry the requests from IA's while those leaving the switch core carry the acknowledgments back to IA's.
Packet header contains all the necessary information to process the current packet by the destination device (switch core or egress adapter discussed in next figure). Typically, this includes the destination port and the priority or CoS associated to the current packet and generally much more e.g., the fact that packet is a unicast or a multicast packet.
On the contrary of the rest of the header the Request/Acknowledgment sub-field (430) is foreign to the current packet and refers to a packet waiting in ingress adapter. Therefore, Request/Acknowledgment sub-field must unambiguously reference the queue concerned by the request or acknowledgment such as (320) in
However, regarding MC requests, it must be highlighted that they do not carry any information related to the destinations of the MC packets they have been issued for, other than the fact that their corresponding packet is destined for multiple egress ports. In the same way as unicast and multicast requests are differentiated with a simple flag, acknowledgments such as (240) in
As an example a packet destined for port N may carry a request for a packet destined for port #2. Thus, carrying packet can be any user packet or just an idle packet that will be discarded by the destination device after the information it carries has been extracted.
It is worth noting here that idle packets can optionally carry information not only in their headers but as well in the payload field (410) since they are not actually transporting any user data.
The invention assumes there is an egress buffer (575) in each egress adapter to temporarily hold (574) the packets to be transmitted. Egress buffer is a limited resource and its occupation must be controlled. The invention assumes that this is achieved by circulating tokens (580) between each egress adapter (570) and the corresponding switch core port. There is one token for each packet buffer space in the egress adapter. Hence, a token is released to switch core (581) each time a packet leaves an egress adapter (572) while one is consumed by switch core (583) each time it forwards a packet (555). In practice, tokens to the egress buffer (583), take the form of a counter in each egress port of the switch core (563). Counter is decremented each time a packet is forwarded. Thus, in this direction, packet is also implicitly the token and has not to be otherwise materialized.
When a packet is released from egress buffer though, corresponding token counter (UTC) such as (563) for unicast packets or (MTC) such as (565) for multicast packets, must be incremented since one buffer has been freed. In this case tokens like (581) materialize by updating a sub-field in the header of any packet entering switch through ingress port #2. Like with the Request/Acknowledgment sub-field shown in
Therefore, switch core is always informed, at any given instant, in each egress port, of how many packet buffers are for sure unoccupied in the egress buffer adapters. Thus, at each packet cycle, it is possible to make a decision to forward, or not, a packet from switch core to egress adapter on the basis of the TC values. Clearly, if a token counter is greater than zero a packet can be forwarded since there is at least one buffer space left in that egress buffer (575).
However, in a preferred embodiment of the invention, requests for multicast traffic are assumed to carry only a multicast flag which does not allow to determine alone what is the particular combination of destinations the corresponding packet is destined for (as described in
As already observed with the requests, acknowledgments and packets, up to RTT tokens can be in fly mainly because of the delay of propagation over cables and wiring and because of the internal processing time of the electronic boards. Hence, egress buffer must be able to hold RTT packets so as switch core can forward RTT packets thus, consuming all its tokens for a destination, before seeing the first token (581) returned just in time to keep sending packets to that destination if there is indeed a continuous traffic of packets to be forwarded.
Simultaneously, while getting a bm vector from LUT, incoming MC packet (622) is temporarily stored in shared memory (625). Depending on a particular implementation this may be for as low as one packet-cycle especially, if all MC tokens are available to replicate and forward the incoming packet to the egress adapters (655) and if no other packets, UC or MC, are waiting to be processed.
Many cases can be encountered depending on the combinations of bm vectors resulting of LUT's interrogations. The simplest case is when EB's targeted by all bm vectors (thus, obtained from LUT's read-outs addressed by RI fields of possibly several simultaneously incoming MC packets arriving from different IA's) do not overlap, moreover, do not overlap either with destinations targeted by unicast packets, which may arrive in the same packet cycle, as a result of unicast acknowledgments that have been sent back by the Request Selection mechanism (365), together with the possible multiple MC acknowledgments. In which case there is no contention at all between packets (unicast or multicast copies) for a same output. However, the general case is when unicast packets and multicast packets, possibly from different sources, contend for a same destination. Nothing is assumed by the invention about the request selection mechanism which may send multiple MC acknowledgments to different IA's as long as there are enough MC tokens available and enough buffer space in shared memory. The worst case is thus when one or more unicast packets, plus as many multicast packets, received from different IA's, as there were MC acknowledgments returned simultaneously to IA's and which now contend for the same switch core egress port. Knowing that only one can be sent per packet-cycle, contending packets need to be temporarily stored in the shared memory (625) for as many cycles as there are packets received for a same egress port in a same cycle.
It is not a purpose of the invention however to choose which packet should be sent out first. Criterions such as packet type (unicast or multicast) or packet priority (high or low) may be used to determine the first packet to be sent to the egress adapter. If unicast packet is sent first, then contending MC packets need to be queued until after the next UC packet departure time. In a preferred embodiment of the invention, this is performed by queuing pointers (690) referencing the shared memory locations of the single copy of those MC packets.
At this point, it should be reminded that one major advantage of shared memory is to natively support multicast. It can indeed deliver as many copies as required from a same packet which needs to be kept in memory until the last copy has been withdrawn, at which time the corresponding buffer space may be released. If unicast packet is not sent first, then it will have to be queued (690), in a way similar to what is to be done when several unicast packets are received in switch core for a same egress port since it is assumed by the invention that the Request Selection mechanism has actually the freedom to do so.
Also, it should also be noticed that the MC token Counters MTC (565) are decremented by one for each MC packet (650) leaving the switch core towards the egress adapter. This indicates that there is one less free position in egress adapter. They are incremented when a multicast token (665) is returned from adapter after one multicast packet has left the adapter egress buffer, thus allowing the switch core to know that one multicast packet location (670) is available again in adapter egress buffer.
It should be mentioned that the differentiation between unicast and multicast tokens, and so the distinction between UTC and MTC counters, usually does not make sense when egress part of adapter has a single physical or logical output. However, there are cases where egress adapter external interface (not shown) is made of several physical or logical outputs. An example can be an adapter connected to a single switch core port providing the equivalent of a 10-Gbps throughput, while actually supporting several external attachments e.g., 4 OC-48 attachments, each one with a 2.5 Gbps throughput. In such a case, an incoming packet may have to be further multicast through several distinct external attachments. Thus, the token (665) corresponding to buffer occupancy of such multicast packet are only returned to switch core when all copies have been forwarded, and memory space released in egress buffer (670).
A packet received in an ingress adapter (700) is unconditionally stored (705) in ingress buffer (710) (an upward flow-control to the actual source of traffic is assumed to not overfill the ingress buffer). The receiving of a packet immediately triggers the sending of a request (715). Request travels up to switch core (720) where filling of shared memory (725) and the availability of a token (730), for the egress port to which received packet is destined for, is checked. If both conditions are met i.e., if there is enough space available in shared memory and if there is one or more tokens left then, an acknowledgment (735) may be issued back to IA (740).
Upon reception of the acknowledgment IA forward unconditionally a packet (745) corresponding to the just received acknowledgment. It is important here to notice that there is no strict relationship between a given request, an acknowledgment and the packet which is forwarded. As explained earlier incoming packets are always queued per destination (VOQ), and also generally per CoS or flow, for which the acknowledgment (735) is issued thus, this is always the head-of-queue packet which is forwarded from IA so as no disordering in the delivery of packets is possibly introduced.
When forwarded packet is received in switch core it is queued to the corresponding egress port and sent to the egress buffer (780), in arrival order, consuming one token (760). If no token is available packet forwarding is momentarily stopped. So is the sending of acknowledgments back to the IA's having traffic for that destination. Already received packets wait in SM but no more are possibly admitted until tokens are returned (775) from the egress adapter as discussed in
Once in egress buffer (780), the packet is queued for the output and leaves (770) egress adapter in arrival order or according to any policy enforced by the adapter. Generally, adapter is designed to forward high priority packets first. If the invention does not assume any particular algorithm or mechanism to select an outgoing packet, from the set of packets possibly waiting in egress buffer, it definitively assumes that a token is released (775) to the corresponding switch core egress port as soon as packet leaves egress adapter (770) so as token eventually becomes available to allow the moving of more packets first from IA to switch core then, from switch core to egress buffer.
At this point, it must be clear that in a switch core according to the invention, the shared memory size need not to be actually as large as the upper bound calculated in
Interestingly enough, this is what the well-known iSLIP algorithm (see earlier reference to iSLIP in the background section), devised for switch cores that use a crossbar, must accomplish. Hence, one possible request selection algorithm is iSLIP which allows to drastically limit the size of the shared memory in a switch fabric according to the invention.
The use of a shared-memory allows however to utilize a more efficient algorithm that tolerates the reception, at each cycle, of several packets for the same egress port (thus, from several ingress adapters) and that can be much more easily carried out at the speeds required by modern terabit-class switch fabrics considered by the invention. Any number between one and RTT packets, the maximum necessary as discussed in
As an example, if selection algorithm retained is able to limit to a maximum of four the number of packets selected at each cycle for a same destination then, shared memory for a 64-port switch needs to hold only 64×4=256 packets. Egress buffer must stick to the RTT rule though. That is, in each egress adapter one must have a 16-packet buffer, possibly per priority, if one needs to support a RTT of 16 packet-times. Ingress buffering size is only dependent of the flow-control between the ingress adapter and its actual source(s) of traffic.
Whenever a unicast data packet is received through the input port to a switch fabric (800) its header is examined. While packet is stored in ingress buffer an entry is made to the tail of the queue it belongs to in order it can later be retrieved. Then, a unicast request, corresponding to the queue it has been appended to, is issued to the switch core (810) which records it in an array of pending requests (cVOQ), image of all the queues of all IA's connected to the switch core and described in
When queue to which request belongs is actually selected (835) an unicast acknowledgment (840) is returned to the corresponding IA and request is immediately canceled since it has been honored. Simultaneously, a shared memory buffer space is reserved by removing one buffer from the count of available SM buffers (even though corresponding packet has not been received yet). In a preferred mode of implementation of the invention cancellation of the honored request just consists in decrementing the relevant individual unicast counter in cVOQ array of request counters.
When acknowledgment reaches IA, packet is immediately retrieved and forwarded to switch core (850) where it can be unrestrictedly stored since space has been reserved when acknowledgment was issued to IA. Then, if an egress unicast token is available, which is normally always the case, packet may be forwarded right away to the egress adapter (870) and SM buffer released.
When packet exits the egress adapter, corresponding buffer space becomes free and one egress UC token is returned to switch core (880).
Turning now to
Whenever a multicast data packet is received through the input port to a switch fabric (900) its header is examined. While packet is stored in ingress buffer an entry is made to the tail of the multicast queue it belongs to in order it can later be retrieved. Then, a multicast request is issued to the switch core (910) which records it in an array of pending requests (cVOQ), image of all the queues of all IA's connected to the switch core and described in
When multicast queue to which request belongs is actually selected (935) a multicast acknowledgment (940) is returned to the corresponding IA and corresponding multicast request is immediately canceled since it has been honored. Simultaneously, a shared memory buffer space is reserved by removing one buffer from the count of available SM buffers (even though corresponding packet has not been received yet). In a preferred mode of implementation of the invention cancellation of the honored request just consists in decrementing the relevant individual multicast counter in cVOQ array of request counters.
When acknowledgment reaches IA, packet is immediately retrieved and forwarded to switch core (950) where it can be unrestrictedly stored since space has been reserved when acknowledgment was issued to IA as explained above. Then, if egress multicast tokens are available for all destinations of the multicast packet as indicated by the bitmap obtained through the RI look-up, which is normally always the case, copies of the multicast packet may be forwarded right away to related egress adapters (970) and SM buffer released. In the case where egress multicast tokens would not be immediately available for all destinations, then copies of the multicast packet may be sent only to those ports for which egress multicast tokens are available, while remaining copies will be sent only when egress multicast tokens will be available again, indicating available space in egress adapter. Only when the last copy has been provided, can the SM buffer be released.
When packet exits the egress adapter, corresponding buffer space becomes free and one egress MC token is returned to switch core (980).
A lack of unicast or multicast tokens could result of a malfunction or congestion of egress adapter. Especially, a downward device, to which egress adapter is attached, may not be ready and prevent egress adapter from sending traffic normally (flow control). Another reason for a lack of tokens would be that actual RTT is larger than what has been accounted for in the design of the switch fabric hence, tokens (and requests and acknowledgments) may need more time to circulate through cables and wiring of a particular implementation of one or more ports. In this case switch fabric is under utilized by those ports since wait states are introduced due to a lack of token and because acknowledgments do not return on time.
It must also be pointed out that, because the request selection algorithm of switch core may authorize several acknowledgments for a same egress port be sent back to IA's, or because of reception of MC acknowledgments, several packets are possibly received for a same egress port in a same packet cycle. Obviously, packets stored in SM, must wait in line until they can be forwarded to egress adapter, in subsequent cycles, consuming one egress token each time. Hence, as long as request selection algorithm can manage to send back to IA's one acknowledgment per egress port at each packet cycle, switch core never later receive more than one packet per destination in SM. If tokens are normally available packets are immediately forwarded to egress adapter and stay in SM for one packet cycle only.
Once in egress adapter, packets to forward are selected against any appropriate algorithm depending on the application. Egress tokens are returned to switch core when packets leave the egress buffer (880, 980).
Instead of being comprised, as shown in
If, however, a MC request is forwarded from an IA, that needs to be replicated through a blocked port, corresponding MC acknowledgment is NOT going to be returned though and, because there is a single MC queue in IA's the whole MC traffic is going to be stopped anyway. However, this form of HoL blocking can easily be skipped if switch core is indeed informed of the port blocking. Knowing the port is malfunctioning, or after a time-out, it can decide either to ignore this port in the returning of the acknowledgments and later in the replication of the packet from the shared memory, or send a discard command to the corresponding IA so as the packet that cannot be normally multicast is dropped from the IA buffer and cvOQ of switch core updated accordingly so as HoL blocking is removed.
One will also notice that, in this mode of operation where RI's are sent with MC requests, RI may not need to be send again in the MC packet header since information can be saved e.g., in a background FIFO (1074) until packet is received and queued to the output ports. This is done when MC request is selected, companion FIFO (1072) readout and counter (1070) decremented, upon sending an acknowledgment back to corresponding IA. Because packets are always delivered in FIFO order, their header then need only to contain a MC flag so as RI is retrieved from the background FIFO when packet is received in switch core. In practice companion and background FIFO's can be implemented under the form of single FIFO (1080) with two read pointers. One for the request to be acknowledged (1084) and one retrieving the RI of current received packet (1082). There is also a write pointer (1086) to enter a new RI with each arriving MC request. Those skilled in the art of logic design knows how to implement such a FIFO from a fixed read/write memory space (1088) e.g., from an embedded RAM (random access memory) in an ASIC (application specific integrated circuit).
This alternate mode of operation is obviously obtained at the expense of a more complicated switch core but can be justified for applications of the invention where multicasting is predominant like with video-distribution and video-conferencing.
In order to limit the hardware necessary to implement the switch core function of the invention one may want to reduce the size of the counters (and associated FIFO's if any) to what is strictly necessary since many of them have to be used. Typically, a cVOQ array of a 64-port switch, with 8 classes of service, supporting multicast traffic, must implement 64×(64×8+1)=32832 individual counters. Saving on counter size is thus multiplied by this number.
On the contrary of the assumption used up to this point of the description which assumes that an unlimited number of requests (i.e., up to the size of ingress buffers) can be forwarded to switch core, size of cVOQ counters can be made lower than what largest ingress buffer can actually hold. Indeed, they can be limited to count RTT packets provided there is the appropriate logic, in each egress adapter, to prevent the forwarding of too many requests to switch core. In other words, IA's can be adapted so as to forward only up to RTT requests while seeing no acknowledgment coming back from switch core. Obviously, the requests in excess of RTT must be queued in each IA and delivered later when count of packets in corresponding cVOQ counter has, for sure, a value less than RTT thus, can be incremented again. This is the case whenever an acknowledgment is returned from switch core for a given queue. Hence, to limit the hardware required by switch core, a logic mechanism must be added in each IA to retain the request in excess of RTT. This complication in the mode of operation of a switch fabric according to the invention may be justified for practical considerations e.g., in order to limit the overall quantity of logic needed for implementing a core and/or to reduce the power dissipation of the ASIC (application specific integrated circuit) generally used to implement this kind of function.
As a consequence, each individual counter of a cVOQ array can be e.g., a 4-bit counter if RTT is lower than 16 packet-times (so that counter can count in a 0-15 range). Likewise, the size of the companion and background FIFO's can be reduced to RTT instead of, typically, several thousands. Hence, each IA must have the necessary logic to retain the requests in excess of RTT on a per queue basis (thus, per individual counter). Thus, there is e.g., an up/down counter to count the difference between the number of received packets minus the number of returned acknowledgments. If the difference stays below RTT, requests can be immediately forwarded as in preceding description. However, if level is above RTT, sending of one request is contingent to the return of one acknowledgment which is the guarantee that counter can be incremented.
Therefore, in such implementation of the invention counter counting capability is shared between the individual counters of the switch core and the corresponding counters of the IA's.
Number | Date | Country | Kind |
---|---|---|---|
03368075.2 | Jul 2003 | EP | regional |
The following patent applications are related to the subject matter of the present application and are assigned to common assignee: 1. U.S. patent application Ser. No. ______ (docket FR920030044US1), Alain Blanc et al., “System and Method for Collapsing VOQ's of a Packet Switch Fabric”, filed concurrently herewith for the same inventive entity; 2. U.S. patent application Ser. No. ______ (docket FR920030045US1), Alain Blanc et al., “Algorithm and System for Selecting Acknowledgments from an Array of Collapsed VOQ's”, filed concurrently herewith for the same inventive entity. The above applications are incorporated herein by reference.