1. Field of the Invention
This invention relates to minimizing data overflow in communications links. More specifically, the invention relates to arrangements for minimizing data overflow by managing data buffer occupancy, especially in a Fibre Channel (FC) environment.
2. Related Art
Fibre Channel (FC) technology is well known in the art. See, for example, Chapter 13.5 (“Fibre Channel”) of William Stallings' Data and Computer Communications (Prentice-Hall, 1997), which like all documents cited herein, is incorporated by reference.
Fibre Channel (FC) provides a credit based flow control to protect against collisions and assure that the receiving port is not flooded with more data than it can handle. This approach avoids overrun and also provides a way to mitigate performance degradation over distance by allowing more in-flight frames. However, Fibre Channel has a detrimental impact on performance if extended distances separate source and destination devices.
Storage area network (SAN) users have deployed Fibre Channel Extender devices to minimize the performance degradation over extended distances. However, such devices, if not properly designed, can actually have adverse impact on system performance. For example, under various realistic conditions, channel extenders drop packets.
One approach involved use of a supplemental overflow data channel. For example, U.S. Patent Application Publication No. 2001/0024432 (Zehavi et al.) discloses an arrangement in which, when a data rate of a packet exceeds a capacity of a main channel, the packet is also transmitted on an overflow channel.
Other approaches have involved complex, distributed management schemes. For example, U.S. Patent Application Publication No. 2003/0065736 (Pathak et al.) discloses an arrangement in which nodes in a wireless data network keep track of an amount of memory that is reported to be available in a client device, so that a network essentially ensures that overflow does not occur in the client devices.
Another approach involves frame pull flow control in which frames remain in a first Fibre Channel device until they are requested by a second Fibre Channel device; see U.S. Patent Application Publication No. 2003/0202474 (Kreuzenstein et al.).
One approach to extending fibre channel performance range is disclosed in U.S. Patent Application Publication No. 2003/0227874 (Wang), which involves a supplemental buffer arrangement governed by a locally generated ready signal. The locally generated signal is substituted for the ready signal that would be remotely generated according to the Fibre Channel standard. Wang's transmitting node keeps a count of the remote buffer usage and stops sending frames if the remote buffer is full. This count is incremented when transmit node sends a frame to the remote node and is decremented when it receives a R_RDY (receiver ready) signal. Undesirably, such arrangements suffer performance degradation if the buffer at the remote node is less than a certain size, often owing to the effects of latency (round-trip communication delay) when awaiting R_RDY signals. Such performance degradation can persist even if there is no data rate mismatch. Most Fibre Channel extenders, including the one disclosed by Wang patent, perform optimally if they operate within design parameters. However, as latency is increased beyond design values, performance decreases and usable bandwidth is wasted.
Accordingly, there is a need in the art for arrangements that adapt to increased latency or network impairment and still provide an optimal performance. Also, there is a need in the art for an arrangement that intelligently and transparently minimizes or eliminates data overflow, even over long distances and using Fibre Channel technology, thus allowing fulfillment of quality of service (QoS) guarantees. There is also a need in the art for an approach that minimizes dropped traffic to an insignificant amount, and, further, that is generic enough to adapt to all data rates and distances between the source and destination devices.
A method manages a communications link in which a first device interfaces with a first channel extender and in which a second device interfaces with a second channel extender, and in which the first and second channel extenders communicate with each other through a communications medium. The method involves monitoring an occupancy of a receive buffer in the first channel extender during transmission from the second channel extender at a first transmission rate; at a first time when the occupancy of the receive buffer exceeds a first threshold, immediately instructing the second channel extender to cease transmission to the first channel extender; monitoring for an overflow condition in the receive buffer; if the overflow condition is present, specifying that a future transmission to the first channel extender be at a second transmission rate that is lower than the first transmission rate; and instructing the second channel extender to resume transmission to the first channel extender.
A more complete appreciation of the described embodiments is better understood by reference to the following Detailed Description considered in connection with the accompanying drawings, in which like reference numerals refer to identical or corresponding parts throughout, and in which:
In describing embodiments illustrated in the drawings, specific terminology is employed for the sake of clarity. However, the invention is not intended to be limited to the specific terminology so selected, and it is to be understood that each specific element includes all technical equivalents that operate in a similar manner to accomplish a similar purpose. Various terms that are used in this specification are to be given their broadest reasonable interpretation when used to interpret the claims.
Moreover, features and procedures whose implementations are well known to those skilled in the art are omitted for brevity. For example, initiation and termination of loops, and the corresponding incrementing and testing of loop variables, may be only briefly mentioned or illustrated, their details being easily surmised by skilled artisans. Thus, the steps involved in methods described herein may be readily implemented by those skilled in the art without undue experimentation.
Further, various aspects, features and embodiments may be described as a process that can be depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel, concurrently, or in a different order than that described. Operations not needed or desired for a particular implementation may be omitted. A process or steps thereof may correspond to a method, a function, a procedure, a subroutine, a subprogram, and so forth, or any combination thereof.
Fibre Channel (FC) provides a credit based flow control to protect against collisions and to assure that the receiving port is not flooded with more data than it can handle. This arrangement not only avoids overruns but also provides a way to mitigate performance degradation over distance by allowing more in-flight frames. A port is designed with a fixed amount of storage capacity, to accept a given number of frames. On power up, each port logs into a fabric, if present, or to an N-port. Several parameters are exchanged, including the amount of data that can be transferred without receiving an acknowledgement. Flow control uses a buffer-to-buffer (B2B) credit mechanism. Storage capacity is referred to in terms of buffer-credits, and acknowledgement is referred to as R_RDY (receiver ready). A port sends only the number of frames equal to the buffer-to-buffer credit of the receiving port.
Fiber Channel performance is known to degrade over distance. As the distance between the FC devices increases beyond a certain point, system performance deteriorates because the “pipe” between the devices is not completely full. That is, the sending FC device is idle between two transmission cycles.
This scenario is shown in shown in
In response to a read request, disk array 120 sends N frames and waits for the first R_RDY from server 110. The first frame (“Frame 0”) is transmitted at time τ0. The Nth frame (“Frame N−1”) is transmitted at time τ1. Due to extended distance between the server and disk array, the first R_RDY is not received until time τ2. The disk array is idle for a time period equal to τ2−τ1.
According to one embodiment, to minimize performance degradation, the B2B credit of the receiving device meets the following relationship:
in which:
N=buffer-to-buffer credit of the receiving FC port
rf=FC line rate (in one example, 1.0625·109 or 2.125·109 bits/s)
sf=FC frame size (in one example, 36+2148=2184 bytes)
τlat is latency and is explained as follows.
For purposes of the present discussion, latency is defined as the total elapsed for one data bit to travel from source to the destination. It is a function of distance between source and destination as well as the characteristics of the communication equipment. Latency can be expressed as:
in which:
In a real system the processing and buffering time are negligible in comparison to the latency due to the distance between the devices and, therefore, may be ignored in the following discussion.
Briefly, channel extenders 310, 320 minimize performance degradation by sending an acknowledgement to the Fibre Channel devices 110, 120 even though the data may not have reached the destination. This function, called “spoofing,” tricks the Fibre Channel transmitting device into thinking that acknowledgement was sent by the receiving device.
Some embodiments of channel extenders include buffer memory to temporarily store the overflow data. One embodiment actually includes two buffers, namely:
In summary, certain embodiments of channel extenders perform some or all of the following functions:
In steady state operation, incoming data is stored in CrB using first-in-first-out (FIFO) method. That is, data is stored on the top of the CrB and read from the bottom.
The times and function values at times shown in
Under normal operating conditions the transmit and receive data rates are equal (Ri=Ro), and CrB contains a small number B0 of bytes (b=B0). Here, ri=Ri and ro=Ro are the input and output data rates, respectively.
One embodiment of the method continuously monitors buffer occupancy b. Control remains in a steady state loop for so long as the output and input data rates are essentially matched (see
A scenario for flow control invocation is now briefly described. At time T0 (500 in
b=B0+(Ri−R′o)(t−T0)
If the drain rate stays at R′i, the CrB will eventually fill up at time tc:
The channel extender invokes flow control by sending a signal to the source device to stop data transmission so that the CrB can be emptied and to avoid buffer overflow.
The management of data during the time after flow control is invoked (
The excess-in-flight data, Ri−R′0, is saved in the OvB. OvB size is a function of the distance between source and destination devices and their data rate. The optimum buffer size for a given distance and data rate of the can be expressed as:
If OvB is less than the optimum size shown above, some of the data will be lost (see
The times and function values at times shown in
Before a more detailed description of the embodiment is presented, flow control termination now briefly introduced (see
The embodiment to be described is better understood after appreciating the following observations.
The number of dropped frames is a function of various factors, including:
Normally, channel extenders' buffer capacity satisfies the criteria expressed in Equations 1 and 2, above. Satisfying such criteria guarantees error free operation. However, as mentioned above, there are numerous applications where the buffer capacity does not satisfy the criteria, and data rate imbalance may result in loss of data.
The inventors have recognized that channel extension technology should recover from a hardware fault that causes lossy transmission. The following embodiment detects a data rate mismatch that would lead to data loss, and takes appropriate measures to slow down the source data rate so that it matches the destination data rate. According to this embodiment, service outage intervals are reduced to the millisecond range.
The method described here manages data buffer occupancy in such a manner that, in case of a data rate mismatch, data overflow is minimized. Some features of the embodiment include:
In one embodiment, the operations summarized in
Referring to
In some embodiments, the Fibre Channel extenders compute round trip latency by exchanging time stamped messages. This approach is simple, accurate, and easy to implement. One embodiment for determining the round trip latency includes:
The transmitting node, after receiving and validating the message, retrieves the time stamp from the message. Assuming wc2 represents the wall clock when the message was received, the round trip latency 2τlat is computed from
Referring to
As a background to understanding
The times and function values at times shown in
Under normal operating conditions the transmit and receive data rates are equal (Ri=Ro), and CrB contains a small number B0 of bytes (b=B0). Here, ri=Ri and ro=Ro are the input and output data rates, respectively.
In
However, if there is a mismatch in the data rate, then ri>ro and B>B0. Accordingly, steady state operation (
The data rate imbalance (Ri>R′o) immediately after time 800 causes excess data to “pile up” in credit buffer CrB. Immediately after time 800 (T0), buffer occupancy can be expressed as:
b=B0+(Ri−R′o)(t−T0)
During this phase, buffer monitoring logic or program steps invoke the flow control mechanism to send a message to the source node to stop data transmission until further notice. This mechanism is invoked when the CrB is full as determined when b=B0+Bcr.
More specifically, referring again to
Each channel extender has the ability to implement a TXcnt counter, an RXcnt counter, and registers as needed. TXcnt contains a count of number of data bytes transmitted since it was reset. TXcnt is incremented when b>B0 and a buffer read operation is performed to transmit a data byte. RXcnt contains a count of number of data bytes received since it was reset. RXcnt is incremented when b>B0 and a buffer write operation is performed, when a data byte is received and stored.
Expressed in words rather than in flowchart form, one embodiment of flow control invocation involves:
The flow control message (
Upon receiving the flow control message, the source device immediately ceases data transmission. Due to latency, the last bit of in-flight data arrives at the destination channel extender at a time T2=Tlwm+2τlat (
Excess-in-flight data is saved in overflow buffer OvB (see
Lossless transmission occurs if overflow buffer OvB 924 is sufficiently large to store the excess in-flight data. Two cases can arise:
In contrast, lossy transmission occurs if the overflow buffer OvB 924 is not sufficiently large to accommodate the excess in-flight data. Lossy transmission thus occurs if Bovf<(Ri−R′0)·2·τlat.
One embodiment of the in-flight data management involves the following steps, expressed in textual format:
Block 730 indicates a preliminary step of resetting an overflow (“O.F.”) flag to its inactive state (by one convention, inactive state is “0”). The flag remains reset until an overflow condition is detected.
Decision block 732 indicates the ongoing monitoring for an overflow condition.
If there is no overflow condition, then control passes directly to decision block 736. However, if an overflow condition is detected, then an overflow flag is set in block 734 before control passes to decision block 736.
Decision block 736 compares the value of the wall clock wc to T′2-, which is a small time ε before T2. If wc≦T2 then the final byte from the sender has not had time to arrive and control passes back to block 732 for continued monitoring for an overflow condition. However, if wc>T2 then control passes out of loop 732-736 to decision block 738. At this time, the overflow flag is either set or not set, depending on whether an overflow condition was detected in block 732.
Block 738 analyzes the rate of change of buffer occupancy b with respect to time. The rate of change may be expressed as a first derivative of b, that is, as db/dt. If the rate of change of buffer occupancy is essentially zero, then buffer occupancy is not being reduced. Accordingly, control remains within the loop 738 to continue to monitor for any change in buffer occupancy.
However, a buffer occupancy change db/dt becoming non-zero indicates that buffer occupancy is being reduced. The reduction in buffer occupancy b derives from the sending node's stopping transmission at time T1 which is reflected at the receiving channel extender ε before time T2. Control passes to blocks 740 and 742. Block 740 indicates immediately saving a copy of the wall clock wc at time T2 (
Briefly, if there is an overflow, the transmitting device is instructed to restrict the data transmission rate to a new, presumably smaller rate R′0. In the absence of overflow (when Ovf is large enough to absorb excess data bytes), the source device is directed to resume data transmission at the same rate R0 that was in effect when it was directed to stop data transmission.
Two conditions for a lossless transmission are:
Accordingly, indications of lossy transmission conditions include:
One embodiment of flow control termination involves the following steps, expressed in textual format.
Decision block 750 indicates the ongoing comparison of b and Bp−Bcr. For as long as b>Bp−Bcr control remains within the monitoring loop including decision block 750. However, when buffer occupancy b is reduced so that b≦Bp−Bcr then control passes to block 752. Block 752 indicates the definition of time Thwm to the value of the wall clock wc at that instant. Thwm is
Decision block 756 continually monitors buffer occupancy b for so long as b>2τlatR′0. This test determines when the buffer has drained sufficiently so that it can be emptied in a round trip delay interval 2τlat. When b finally decreases to a point at which b≦2τlatR′0 then control passes to decision block 758.
Decision block 758 examines the overflow (“O.F.”) flag. If the overflow flag has been set (see
Channel extenders may be embodied by any suitable systems for performing the methods described herein, the systems including at least one data processing element. Generally, these data processing elements may be implemented as any appropriate computer(s) employing technology known by those skilled in the art to be appropriate to the functions performed. The computer(s) may be implemented using a conventional general purpose computer programmed according to the foregoing teachings, as will be apparent to those skilled in the computer art. Appropriate software can readily be prepared by programmers based on the teachings of the present disclosure. Suitable programming languages operating with available operating systems may be chosen.
General purpose computers may implement the foregoing methods, in which the computer housing may house a CPU (central processing unit), memory such as DRAM (dynamic random access memory), ROM (read only memory), EPROM (erasable programmable read only memory), EEPROM (electrically erasable programmable read only memory), SRAM (static random access memory), SDRAM (synchronous dynamic random access memory), and Flash RAM (random access memory), and other special purpose logic devices such as ASICs (application specific integrated circuits) or configurable logic devices such GAL (generic array logic) and reprogrammable FPGAs (field programmable gate arrays).
Each computer may also include plural input devices (for example, keyboard, microphone, and mouse), and a display controller for controlling a monitor. Additionally, the computer may include a floppy disk drive; other removable media devices (for example, compact disc, tape, and removable magneto optical media); and a hard disk or other fixed high-density media drives, connected using an appropriate device bus such as a SCSI (small computer system interface) bus, an Enhanced IDE (integrated drive electronics) bus, or an Ultra DMA (direct memory access) bus. The computer may also include a compact disc reader, a compact disc reader/writer unit, or a compact disc jukebox, which may be connected to the same device bus or to another device bus.
In
Various other elements are shown connected to processor 900 by busses 910. For example, a set of counters 920 is accessible to processor 900. In the embodiments discussed above, such counters include TXcnt, RXcnt, and OvFcnt.
The channel extender also includes memory of various kinds suitable for various purposes. For purposes of illustration,
As is readily understood by those skilled in the art, busses 910 include address, data, and control lines that are generally under control of processor 900. Element 910 is understood to encompass plural busses, including direct memory access busses, special purpose busses, and the like, that may be chosen for a particular application. Those skilled in the art readily understand that elements connected to busses 910 may also be partially or completely incorporated within processor 900 even though they are separately illustrated.
The invention envisions at least one computer readable medium. Examples of computer readable media include compact discs, hard disks, floppy disks, tape, magneto optical disks, PROMs (for example, EPROM, EEPROM, Flash EPROM), DRAM, SRAM, SDRAM. Stored on any one or on a combination of computer readable media is software for controlling both the hardware of the computer and for enabling the computer to interact with other elements, to perform the functions described above. Such software may include, but is not limited to, user applications, device drivers, operating systems, development tools, and so forth. Such computer readable media further include a computer program product including computer executable code or computer executable instructions that, when executed, causes a computer to perform the methods disclosed above. The computer code may be any interpreted or executable code, including but not limited to scripts, interpreters, dynamic link libraries, Java classes, complete executable programs, and the like.
From the foregoing, it will be apparent to those skilled in the art that a variety of methods, systems, computer programs on recording media, and the like, are provided.
The present disclosure supports a method of managing a communications link in which a first device (110) interfaces with a first channel extender (310) and in which a second device (120) interfaces with a second channel extender (320), and in which the first and second channel extenders (310, 320) communicate with each other through a communications medium (330). The method may involve (
The first transmission rate (R0) may constitute a first steady-state transmission rate, and the second transmission rate (R′0) may constitute a lower, second steady-state transmission rate that guards against future overflow conditions in the receive buffer (922+924).
The method may further involve causing the first and second channel extenders (310, 320) to communicate with respective first and second devices (110, 120) using a Fibre Channel standard, so that the first and second devices (110, 120) can communicate with each other based on only the Fibre Channel standard; and causing the first and second channel extenders (310, 320) to generate a false R_RDY signal (
The first channel extender (310) may carries out the method with a receive buffer (922+924) that is sized large enough to have a buffer-to-buffer credit N defined by:
wherein rf is a Fibre Channel line rate; sf is a Fibre Channel frame size; and τlat is latency constituting a total elapsed for one data bit to travel between the two channel extenders.
The method may further involve waiting (736) a round trip latency period (2τlat) after the step of immediately instructing (718) the second channel extender (320) to cease transmission, before a second time (T2) of performing a step (738) of monitoring a rate of change (db/dt) of the receive buffer's occupancy (b).
The method may further involve, after the rate of change (db/dt) of the receive buffer's occupancy (b) becomes negative, monitoring (750) the receive buffer's occupancy (b) for a third time (Thwm) at which the receive buffer's occupancy (b) reaches a second threshold (750).
The method may further involve, after the step (750) of monitoring the receive buffer's occupancy (b) for the third time (Thwm) at which the receive buffer's occupancy (b) reaches the second threshold (750), calculating the second transmission rate (R′0) as:
wherein TXcnt is an amount of data emptied from the receive buffer since an initial data rate imbalance began, and after a minimum number (Bcr) bytes have been emptied from the receive buffer since the rate of change (db/dt) of the receive buffer's occupancy (b) became negative; Thwm is the third time, at which the receive buffer's occupancy (b) reached a second threshold (750); and T0 is the first time (T0), at which the occupancy of the receive buffer (922+924) exceeded the first threshold (B0+Bcr).
The method may further involve monitoring (756) the receive buffer's occupancy (b) for when the receive buffer has drained sufficiently so that the receive buffer can be emptied in a round trip delay interval (2τlat); and after the receive buffer has drained sufficiently, instructing (762) the second channel extender to resume transmission to the first channel extender.
The step of monitoring the receive buffer's occupancy (b) for when the receive buffer has drained sufficiently so that the receive buffer can be emptied in a round trip delay interval may involve comparing the receive buffer's occupancy (b) to 2τlat R′0; wherein 2τlat is a round-trip latency between the first and second channel extenders; and R′0 is the second transmission rate.
If (
The present disclosure further supports a computer program product including computer executable code or computer executable instructions that, when executed, causes a at least one computer to perform the methods described herein.
The present disclosure further supports a system, such as a channel extender, configured to perform the methods described herein.
Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art will recognize that many changes may be made thereto without departing from the spirit and scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.
Number | Name | Date | Kind |
---|---|---|---|
5991266 | Zheng | Nov 1999 | A |
6480477 | Treadaway et al. | Nov 2002 | B1 |
6810031 | Hegde et al. | Oct 2004 | B1 |
20010024432 | Zehavi et al. | Sep 2001 | A1 |
20020178336 | Fujimoto et al. | Nov 2002 | A1 |
20030065736 | Pathak et al. | Apr 2003 | A1 |
20030202474 | Kreuzenstein et al. | Oct 2003 | A1 |
20030218981 | Scholten | Nov 2003 | A1 |
20030227874 | Wang | Dec 2003 | A1 |
20050047334 | Paul et al. | Mar 2005 | A1 |
20060159098 | Munson et al. | Jul 2006 | A1 |