The present application claims priority from Japanese application JP 2006-350847 filed on Dec. 27, 2006, the content of which is hereby incorporated by reference into this application.
The invention relates to a switching technology for dynamically and mutually connecting plural functional blocks, existing in routers, servers, storage units and so forth, with each other, and in particular, to a technology for implementing prioritized bandwidth management on the basis of prioritized information added to data by utilizing plural switches that operate independently.
With a network transfer unit such as a router, a server, a storage unit for connecting plural disk arrays with each other, and so forth, a switch fabric is utilized for executing data switch between functional blocks within respective units. Since there are limitations to a switching bandwidth of the switch fabric, it is desirous to implement data switch according to priority when plural input data units converge on the same destination. That is, it is desired that high priority data be switched at a low delay or a high throughput.
With the network transfer unit, generally called the router, and a switch, when data called a packet or a frame is received from a network, priority of the data, within the unit, is decided upon by making use of header information of the data, and information on priority within the unit is added to the data. For example, voice data, video data, data passing through a specific path, and so forth are given high priority while data other than those data units is given low priority. Then, priority management is achieved by changing how to handle relevant data in the switch fabric within the unit by making use of the information on priority, as added.
A method for priority management in the switch fabric can be generally classified into two methods as follows. A first one is a method whereby a transmitting source node is provided with a function for prioritized bandwidth management. With this method, if priority is low, output is inhibited unless a certain threshold condition is met even in a status where data can be transmitted to a switching device. That is, lower delay or higher throughput of high priority data can be implemented by providing data input to the switch fabric with restrictions on output on a priority-by-priority basis. JP-A No. 2002-247080 is cited as a specific example representing this technology.
A second one is a method whereby a switching device in the switch fabric is provided with a function for selectively outputting priority data. With this method, arbitration of data output is executed in the switching device, in units of a packet of a variable length, or in units of a cell of a fixed length, the cell being a constituent of a packet, on a destination-by-destination basis. Lower delay or higher throughput of high priority data can be implemented by selectively outputting higher priority data on a preferential basis.
Those conventional methods, however, each have a problem. First, with the first method, because usable bandwidths are always limited according to priority, output of low priority data will be restricted even if the switch fabric is in a non-congestion state where the switch fabric is unoccupied when only low priority data exists. As a result, there arises a problem that a switching bandwidth of the switch fabric cannot be fully utilized.
Further, in US 20060104298 (A1), there has been described a method for executing bandwidth management by causing a switch to transmit information on where congestion has occurred, in the form of a command, to a transmitting side node when the switch fabric is in a status of congestion, in which case, it is necessary for the switch to have special workings for generating the command. Furthermore, as it takes some time for the command to reach the transmitting side node, this method somewhat lacks in quick responsiveness.
Then, with the second method, a portion of low priority data, transmitted from the transmitting side node to a switching device, is liable to be retained in the switching device, and such a phenomenon can cause a problem. For example, in the case where the switching device does not have independent primary data-holding regions by the priority, a problem occurs in that preceding low priority data interferes with proceeding of succeeding high priority data from the same transmitting source. In order to avoid such a problem, the switching device needs to have the independent primary data-holding regions by the priority on a transmitting source-by-transmitting source basis, resulting in an increase in hardware size, in proportion to the number of priorities, leading to an increase in hardware cost, so that a problem still remains.
Further, with the second method, in the case of data of a variable length being switched with a dispersion type switch wherein a switch with a switching throughput equivalent to 1/K of an object switching throughput, is prepared for each of K planes, all transmitting source nodes are connected to all destination nodes, respectively, against the respective switches on the K planes, and input data units are dispersed on respective switch planes so as to undergo parallel actions, the problem will be found pronounced. In order to simplify a hardware configuration, an operation is generally executed whereby the data of the variable length is divided into plural data units of a fixed length in the switch fabric before transmission, and the plurality of the data units are reassembled into the original data of the variable length at the destination.
At this point in time, if the switching device has the function for selectively outputting the priority data, when collision between high priority data and low priority data occurs on some of the switches on the K planes, the low priority data is left out in the switching device. Meanwhile, with the switches where collision has not occurred, the low priority data as it is will pass therethrough, so that there occurs a phenomenon where only a portion of the data of the variable length is retained in the respective switches. If this state continues, the transmitting side node will transmit data one after another by making use of unoccupied switch planes, resulting in occurrence of a phenomenon where succeeding low priority data overtakes preceding low priority data. Particularly, in the case of the number of the switch planes being numerous, or the number of the nodes being numerous, this problem has large influences. In order that the original data of the variable length is reproduced at the destination node, there is the need for queuing of all data units of the fixed length, making up the original data of the variable length, however, in a state where the retention of the data frequently occurs in the respective switches, described as above, the logic, and memory, for queuing of the retained data, will inevitably turn giant in magnitude, giving rise to a problem in terms of cost.
A problem to be resolved is to allow high priority data to pass at a low delay or high throughput in a congestion state where a specific destination in the switch fabric is congested. At the same time, another problem to be resolved is to make full use of a switching bandwidth regardless of priority in a non-congestion state where the specific destination in the switch fabric is not congested.
In accordance with one aspect of the invention, with a switch fabric which includes plural transmitting source nodes each having not less than two output queues by the priority on a destination-by-destination basis, a switch for evenly distributing data units delivered from the plurality of the transmitting source nodes on the destination-by-destination basis, and plural destination nodes for receiving the data units from the switch, the respective transmitting source nodes assume that a relevant destination is in a congestion state when an available capacity of a receive-buffer of the switch, controlled by the respective transmitting source nodes, on the destination-by-destination basis, falls short of a set congestion threshold, thereby restricting data output from the output queues by the priority to the relevant destination up to a preset bandwidth according to priority while the respective transmitting source nodes assume that the congestion state of the relevant destination is dissolved when the available capacity of the receive-buffer of the switch, on the destination-by-destination basis, exceeds the set congestion threshold, thereby dissolving restriction on the bandwidth, according to the priority.
If the invention is put to use, this will enable high priority data to pass at a low delay or high throughput in the congestion state where the specific destination in the switch fabric is congested. At the same time, it is possible to make full use of the switching bandwidth regardless of priority in the non-congestion state where the specific destination in the switch fabric is not congested. Furthermore, it is possible to provide the workings of the prioritized bandwidth management with the use of hardware resources as small in scale as possible.
Embodiments of the invention are described in more detail hereinafter with reference to the accompanying drawings.
In
The transmitting source nodes 100 each have virtual output queue (VOQ: Virtual Output Queue) by the destination, and by the priority. In this case, the virtual output queues each have two classes of priorities, that is, high priority QoS1 VOQs 110A to 113A, and low priority QoS0 VOQs 110B to 113B. The VOQs 110A to 113A, and 110B to 113B each have an independent credit on a destination-by-destination basis regardless of priority on a credit table 120. Herein, the credit refers to an available capacity of a receive-buffer of the switch 200, provided on a transmitting source-by-transmitting source basis, and on a destination-by-destination basis. The VOQs 110A to 113A, and 110B to 113B, having the credit, respectively, can transmit data to the switch 200.
Now, common credit management by the switch fabric is described with reference to
As above-described, the transmitting source 100 can transmit data to a relevant destination as long as there remains a credit at the destination, for use by the switch 200. Every time data passes through the switch 200, the switch 200 returns a recovery credit to the relevant transmitting source of the data, thereby recovering the credit. Further, the switch 200 needs to have a buffer region corresponding to not less than time (RTT: Round Trip Time) required from transmission of the data until recovery of the credit by the transmitting source, and the transmitting source 100 has the number of credits, corresponding to the available capacity of the buffer, previously described. When data smoothly flows, there will continue a status where the number of the credits of the transmitting source 100, corresponding to the RTT, has been used up.
Further, a state where congestion occurs to the switch fabric is described with reference to
If the statuses shown in
Now, a prioritized bandwidth management method for a switch fabric, according to the invention, is described hereinafter with reference to
In the status 130, data output bandwidths of the VOQs 119A, 119B are not restricted owing to priority. For this reason, in the status 130, data of either the VOQ 119A, or the VOQ 119B can be outputted without restrictions imposed on bandwidth. With the switch 200 in
In the status 140, the data output bandwidths of the VOQs 119A, 119B are restricted according to the priority. More specifically, the data output bandwidth of the VOQ 119A is not restricted while the data output bandwidth of the VOQ 119B is restricted. More commonly, data output bandwidth of the highest priority VOQ is not restricted, and data output bandwidths of other VOQs with low priority are restricted.
With the switch 200 in
Accordingly, with the invention, detection on whether a certain destination in the switch fabric is in a congestion state, or the congestion state is dissolved is made with the aid of remained credits of the destination, thereby changing over between the status of prioritized bandwidth management enabled and the status of the prioritized bandwidth management disabled. This method is described hereinafter with reference to
First, as previously described with reference to
The RTT threshold 620 refers to the number of the credits, corresponding to a data length transmittable during an interval from when the transmitting source node 100 transmits data to the switch 200 until when a recovery credit from the switch 200 reaches the transmitting source node 100. In the case of continuation of data transmit from only one transmitting source node 100 to a certain destination node 300, the number of the remained credits coincides with the RTT threshold 620.
The congestion threshold 630 is at a value not higher than the RTT threshold 620 in
When a congestion occurs, setting of the transmit inhibit thresholds 60X (X=0 to 3) on a priority-by-priority basis are enabled. Each of data units with respective priorities can be outputted only if the number of the remained credits, not less than the transmit inhibit threshold 60X (X=0 to 3), is left out. Assuming that the higher an X value, the higher the priority is, the higher the X value, the smaller the transmit inhibit threshold 60X is rendered. At least in the case of the highest priority (QoS3), transmit should be possible until the credits are used up, and in
The congestion dissolved threshold 640 refers to a threshold at which a congestion state is assumed as dissolved. If data transfer to a relevant destination is interrupted, and return of the recovery credit from the switch 200 continues, the number of the remained credits will exceed the congestion dissolved threshold 640. At this point in time, setting of the transmit inhibit thresholds 60X (X=0 to 3) on the priority-by-priority basis are disabled. In general, the congestion dissolved threshold 640 is at a value greater than any of the transmit inhibit thresholds 60X (X=0 to 3) on the priority-by-priority basis.
Now, the prioritized bandwidth management method according to the invention is described hereinafter with reference to
If, in the middle of continuation of data transmit from a transmitting source to a relevant destination, another transmitting source as well starts data transmit to the same destination, return of a credit to the transmitting source comes to be interrupted, so that the number of the remained credits will be short of the congestion threshold as shown in a status 12 in
If data output from the relevant transmitting source is interrupted, and return of the recovery credit of the relevant destination to the relevant transmitting source continues, the number of the remained credits exceeds the congestion dissolved threshold 640, whereupon a status 14 in
Now, in
The transmitting source node 100 has the number of the VOQs, expressed by the product of the number of priorities, and the number of destinations. VOQ arbiters 170 to 173 each gather an output arbitration request from the respective VOQs on a priority-by-priority basis, selecting the VOQs serving as candidates, respectively, on the basis of an algorithm of round robin, and so forth.
Subsequently, one of the select candidates VOQs, having the highest priority, is selected by a QoS arbiter 180. After the selection, remained credits at destinations for the select VOQs are checked by a remained credit checker 192. The remained credits are read from the credit table 120, and if the prioritized bandwidth management for the relevant destination is enabled, checking is executed with the use of a value, by which the number of the remained credits, associated with the priority transmit inhibit threshold, is decreased according to priority. If the prioritized bandwidth management for the relevant destination is disabled, checking is executed by making use of a value read from the credit table 120, as it is. When the remained credit checker 192 determines that there remains a credit, it follows that the select VOQs have won in output arbitration, so that data output can be executed as long as the credit remain. The credit is recovered upon return of the recovery credit from the switch (step 150).
Every time a winner VOQ outputs data, the number of the remained credits for the relevant destination, on the credit table 120, is decreased (step 151), and the respective VOQ arbiters 170 to 173, on the priority-by-priority basis, modify results of the algorithm as selected, that is, decrease a priority-select output-number one by one, for example, in the case of round robin management (step 152). Further, a read pointer for the winner VOQ is modified to prepare for reading of subsequent data (step 153).
Now, in
In the case of non-congestion, the data output from the VOQ is possible regardless of the priority of data as long as there exists not less than one remained credit. That is, a management status will be the same as that for QoS3 shown in
In the case where the prioritized bandwidth management is disabled, as an input rate approaches 100%, so an effective switching throughput, that is, a switch rate of data keeps decreasing. On the other hand, in the case where the prioritized bandwidth management is enabled, high priority data can maintain the effective switching throughput substantially close to 100%, in other words, the switch rate of data can be maintained even if the input rate approaches 100% provided that data with plural priorities mixed therein is inputted. To put it another way, it follows that the effective switching throughput of high priority data is enhanced by decreasing the effective switching throughput of low priority data. A method of changing over between the enabling of the prioritized bandwidth management, and the disabling of the prioritized bandwidth management is as previously described with reference to
The first embodiment of a method for executing the prioritized bandwidth management by changing over between the enabling of the prioritized bandwidth management, and the disabling of the prioritized bandwidth management, according to the invention, has been described in detail as above. It is to be pointed out, however, that the present description is concerned with nothing but one embodiment of the invention, and that various changes and modifications may be made in the invention without departing from the spirit and scope thereof.
In the case of the first embodiment, in transmission from the transmitting source node 100, unicast to a single destination node 300 is presumed, however, even in the cast of multi-cast to plural destination nodes 300, it is possible to execute similar prioritized bandwidth management, which is described hereinafter as a second embodiment of the invention.
When supporting the multi-cast, VOQs for exclusive use in the multi-cast, in number corresponding to the number of priorities to be handled, are prepared in addition to the VOQs of each of the transmitting source nodes 100, according to the first embodiment, for use in the unicast.
Processing is basically the same as that for the first embodiment, however, in the case of a transmitting source node 100 selecting multi-cast data, remained credits for all destinations corresponding to the transmitting source node 100 are referred to on the credit table 120 shown in
When prioritized bandwidth management is enabled for even one of the destinations corresponding to the transmitting source node 100, processing is executed on the assumption that the prioritized bandwidth management is enabled for all the destinations.
It is to be pointed out that the present description is concerned with nothing but one embodiment of the invention, and that various changes and modifications may be made in the invention without departing from the spirit and scope thereof.
With the first and second embodiments, respectively, the remained credits of the switch 200, to be controlled by the transmitting source node 100 in
When a receive-buffer independent of respective destinations is provided for every transmitting source node 100 within the switch 200, even if a certain destination is in a congestion state, data transmit will be enabled without other destinations being subjected to the effect of the congestion at all. However, there arises a problem that a chip area of a switching device making up the switch 200 becomes huge in size with the square of the number of the ports. Methods for preventing the switch from becoming huge include a method for sharing the receive-buffers of the switch 200.
As a first one of the method for sharing the receive-buffer of the switch 200, there is a method whereby the receive-buffer of the switch 200 is shared by plural the transmitting source nodes 100, and as a second one of the method, there is a method whereby the receive-buffer of the switch 200 is shared by plural the destinations for every transmitting source node 100. With the first method, an available capacity of the receive-buffer, that is, the number of remained credits will be changed according to transmit states of other transmitting source nodes 100, thereby rendering management complicated, so that the first method is not preferable. Accordingly, there is described herein a prioritized bandwidth management method for the switch fabric with reference to the second method.
With the method whereby the receive-buffer of the switch 200 is shared by the plurality of the destinations for every transmitting source node 100, control of remained credits on the credit table 120 shown in
When a certain destination is lacking in remained credit, and the number of the remained credits falls short of the congestion threshold 630 shown in
The present embodiment is suitable for application, particularly in the case of making use of the switch 200 with the multiple-ports, in
Further, it is to be pointed out that the present description is concerned with nothing but one embodiment of the invention, and that various changes and modifications may be made in the invention without departing from the spirit and scope thereof.
Lately, with many network transfer units such as routers, switches (L2 switch, L3 switch, etc.) and so forth, use is made of an Ethernet frame (hereinafter called a packet) of a variable length, as transfer data, and the network transfer unit having the switch fabric often divides a packet into cells of a fixed length to be subsequently transferred. That is, input data to the switch fabric appears to be made up of plural data units. Accordingly, as a fourth embodiment, there is shown an application method of the invention in the case of data to be handled being a packet with respect to the first to third embodiments, respectively.
With the fourth embodiment, the VOQs of the transmitting source node 100, shown in
In this case, in the middle of a certain packet being divided into cells to be then transferred to the switch 200, it can happen that the number of remained credits for a relevant destination falls short of the congestion threshold 630 shown in
Further, all the packets successfully taken out of the VOQs while the prioritized bandwidth management for the relevant destination is enabled may be converted into the cells without the restrictions according to QOSX, the transmit inhibit thresholds 60X (X=0 to 3), to be then transmitted. Otherwise, as for the packets successfully taken out of the VOQs while the prioritized bandwidth management for the relevant destination is enabled, transmission of a portion of cells thereof, subjected to the restrictions according to QOSX, the transmit inhibit thresholds 60X (X=0 to 3), is once stopped, and only a portion of the cells, not subjected to the restrictions according to QOSX, the transmit inhibit thresholds 60X (X=0 to 3), may be transmitted.
Still further, it is to be pointed out that the present description is concerned with nothing but one embodiment of the invention, and that various changes and modifications may be made in the invention without departing from the spirit and scope thereof.
The first to fourth embodiments have been described on the premise that the switch 200 of the switch fabric is a single-stage switch. However, in order to significantly increase the number of ports to be handled, it is necessary to make up a multi-stage connecting net of not less than three stages, such as a cross net, Venetian net, and so forth, using plural switching devices. Even in such cases, the same prioritized bandwidth management as shown in the first to fourth embodiments, respectively, can be implemented, and points to be modified for this purpose are described hereinafter as a fifth embodiment of the invention.
With the fifth embodiment, remained credits of the switch 200, handled by the transmitting source node 100, indicate an available buffer-capacity of a switching device positioned in a stage closest to the transmitting source node 100. It is unnecessary for the transmitting source node 100 to control an available buffer-capacity of a switching device in a second stage and onwards, and remained credits of a switching device in a N-th stage (N is an integer not less than 2) is generally controlled by a switching device in a (Nā1)-th stage. Enabling and disabling of the prioritized bandwidth management, and a prioritized bandwidth management method for the transmitting source node 100, in respective statuses, may be executed as with the case of the first embodiment. More specifically, in a switching system making up the multi-stage connecting net with the use of the plurality of the switching devices, the number of the remained credits is controlled as an available buffer-capacity of a switching device positioned in a stage closest to the transmitting source node 100 within the multi-stage connecting net, on a destination-by-destination basis, and on the basis of such information only, changeover between the enabling and the disabling of the prioritized bandwidth management is executed.
Yet further, it is to be pointed out that the present description is concerned with nothing but one embodiment of the invention, and that various changes and modifications may be made in the invention without departing from the spirit and scope thereof.
The prioritized bandwidth management method according to the invention can be used in a system requiring data switch by utilizing a large capacity line. It is conceivable to make use of the prioritized bandwidth management method according to the invention for the switch fabric in the network transfer units, represented by the router, and the switch, and for the switch fabric in the network transfer units such as the server, and the storage device by way of example.
Number | Date | Country | Kind |
---|---|---|---|
2006-350847 | Dec 2006 | JP | national |