An interconnection network consists of a set of nodes connected by channels. Such networks are used to transport data packets between nodes. They are used, for example, in multicomputers, multiprocessors, and network switches and routers. In multicomputers, they carry messages between processing nodes. In multiprocessors, they carry memory requests from processing nodes to memory nodes and responses in the reverse direction. In network switches and routers, they carry packets from input line cards to output line cards. For example, published International application PCT/US98/16762 (WO99/11033), by William J. Dally, Philip P. Carvey, Larry R. Dennison and P. Allen King, and entitled “Router With Virtual Channel Allocation,” the entire teachings of which are incorporated herein by reference, describes the use of a three-dimensional torus interconnection network to provide the switching fabric for an internet router.
A key issue in the design of Interconnection networks is the management of the buffer storage in the nodes or fabric routers that make up the interconnection fabric. Many prior routers, including that described in the noted pending PCT patent application, manage these buffers using virtual-channel flow control as described in “Virtual Channel Flow Control,” by William J. Dally, IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 2, March 1992, pp. 194-205. With this method, the buffer space associated with each channel is partitioned and each partition is associated with a virtual channel. Each packet traversing the channel is assigned to a particular virtual channel and does not compete for buffer space with packets traversing other virtual channels.
Recently routers with hundreds of virtual channels have been constructed to provide isolation between different classes of traffic directed to different destinations. With these routers, the amount of space required to hold the virtual channel buffers becomes an issue. The number of buffers, N, must be large to provide isolation between many traffic classes. At the same time, the size of each buffer, S, must be large to provide good throughput for a single virtual channel. The total input buffer size is the product of these two terms T=N×S. For a router with N=512 virtual channels and S=4 flits, each of 576 bits, the total buffer space required is 2048 flits or 1,179,648 bits. The buffers required for the seven input channels on a router of the type described in the above patent application take more than 7Mbits of storage.
These storage requirements make it infeasible to implement single-chip fabric routers with large numbers of large virtual channel buffers in present VLSI technology which is limited to about 1 Mbit per router ASIC chip. In the past this has been addressed by either having a smaller number of virtual channels, which can lead to buffer interference between different traffic classes, by making each virtual channel small (often one flit in size) which leads to poor performance on a single virtual channel, or by dividing the router across several ASIC chips which increases cost and complexity.
The present invention overcomes the storage limitations of prior-art routers by providing a small pool of on-chip flit-buffers that are used as a cache and overflowing any flits that do not fit into this pool to off-chip storage. Our simulation studies show that while buffer isolation is required to guarantee traffic isolation, in practice only a tiny fraction of the buffers are typically occupied. Thus, most of the time all active virtual channels fit in the cache and the external memory is rarely accessed.
Thus, in accordance with the present invention, a router includes buffers for information units such as flits transferred through the router. The buffers include a first set of rapidly accessible buffers for the information units and a second set of buffers for the information units that are accessed more slowly than the first set.
In the preferred embodiment, the fabric router is implemented on one or more integrated circuit chips. The first set of buffers is located on the router integrated circuit chips, and the second set of buffers is located on memory chips separate from the router integrated circuit ships. The second set of buffers may hold information units for a complete set of virtual units.
In one embodiment, the first set of buffers comprises a buffer pool and a pointer array. The buffer pool is shared by virtual channels, and the array of pointers points to information units, associated with individual channels, within the buffer pool.
In another embodiment, the first set of buffers is organized as a set associative cache. Specifically, each entry in the set associative cache may contain a single information unit or it may contain the buffers and state for an entire virtual channel.
Flow control may be provided to stop the arrival of new information units while transferring information units between the first set of buffers and the second set of buffers. The flow control may be blocking or credit based.
Miss status registers may hold the information units waiting for access to the second set of buffers. An eviction buffer may hold entries staged for transfer from the first set of buffers to the second set of buffers.
Applications of the invention include a multicomputer interconnection network, a network switch or router, and a fabric router within an internet router.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
Although the present invention is applicable to any router application, including those in multicomputers, multiprocessors, and network switches and routers, it will be described relative to a fabric router within an internet router. Such routers are presented in the above-mentioned PCT application.
As illustrated in
The network is made up of links and routers. In the network backbone, the links are usually fiber optic communication channels operating using the SONET (synchronous optical network) protocol. SONET links operate at a variety of data rates ranging from OC-3 (155 Mb/s) to OC-192 (9.9 Gb/s). These links, sometimes called trunks, move data from one point to another, often over considerable distances.
Routers connect a group of links together and perform two functions: forwarding and routing. A data packet arriving on one link of a router is forwarded by sending it out on a different link depending on its eventual destination and the state of the output links. To compute the output link for a given packet, the router participates in a routing protocol where all of the routers on the Internet exchange information about the connectivity of the network and compute routing tables based on this information.
Most prior art Internet routers are based on a common bus or a crossbar switch. Typically, a given SONET link 30 is connected to a line-interface module. This module extracts the packets from the incoming SONET stream. For each incoming packet, the line interface reads the packet header, and using this information, determines the output port (or ports) to which the packet is to be forwarded. To forward the packet, a communication path within the router is arbitrated, and the packet is transmitted to an output line interface module. The module subsequently transmits the packet on an outgoing SONET link to the next hop on the route to its destination.
The router of the above mentioned international application overcomes the bandwidth and scalability limitations of prior-art bus- and crossbar based routers by using a multi-hop interconnection network as a router, in particular a 3-dimensional torus network illustrated in
In the 3-dimensional torus switching fabric of nodes illustrated in
Typical packets forwarded through the internet range from 50 bytes to 1.5 Kbytes. For transfer through the fabric network of the internet router of the present invention, the packets are divided into segments, or flits, each of 36 bytes. At least the header included in the first flit of a packet is modified for control of data transfer through the fabric of the router. In the preferred router, the data is transferred through the fabric in accordance with a wormhole routing protocol.
Flits of a packet flow through the fabric in a virtual network comprising a set of buffers. One or more buffers for each virtual network are provided on each node in the fabric. Each buffer is sized to hold at least one flow-control digit or flit of a message. The virtual networks all share the single set of physical channels between the nodes of the real fabric network, and a fair arbitration policy is used to multiplex the use of the physical channels over the competing virtual networks.
A fabric router used to forward a packet over the switch fabric from the module associated with its input link to the module associated with its output link is illustrated in
If a virtual channel of a fabric router destined for an output node is free when the head flit of a packet arrives for that virtual channel, the channel is assigned to that packet for the duration of the packet, that is, until the tail flit of the packet passes. However, multiple packets may be received at a router for the same virtual channel through multiple inputs. If a virtual channel is already assigned, the new head flit must wait in its flit buffer. If the channel is not assigned, but two head flits for that channel arrive together, a fair arbitration must take place. Until selected in the fair arbitration process, flits remain in the input buffer, backpressure being applied upstream.
Once assigned an output virtual channel, a flit is not enabled for transfer across a link until a signal is received from the downstream node that an input buffer at that node is available for the virtual channel.
Prior routers have used a buffer organization, illustrated in
When flit 100 arrives at the input channel, it is stored in the flit buffer at a location determined by the virtual channel identifier (VCID) field of the flit 101 and the L-field of the selected row of the flit buffer. First, the VCID is used to address the buffer array 200 to select the row 210 associated with this virtual channel. The L-field for this row points to the last flit placed in the buffer. It is incremented to identify the next open flit buffer into which the arriving flit is stored. When a particular virtual channel of the input channel is selected to output a flit, the flit buffer to be read is selected in a similar manner using the VCID of the virtual channel and the F field of the corresponding row of the buffer array.
Simulations have shown that, for typical traffic, only a small fraction of the virtual channels of a particular input port are occupied at a given time. We can exploit this behavior by providing only a small pool of buffers on chip with a full array of buffers in slower, but less expensive, off-chip storage. The buffers in the pool are dynamically assigned to virtual channels so that at any instant in time, these buffers hold flits for the fraction of virtual channels that are in use. The buffer pool is, in effect, a cache for the full array of flit buffers in off-chip storage. A simple router ASIC with an inexpensive external DRAM memory can simultaneously support large numbers of virtual channels and large buffers for each virtual channel. In a preferred embodiment there are V=2560 virtual channels, each with S=4 flits with each flit occupying F=576-bit flits of storage.
A first preferred embodiment of the present invention is illustrated in
A more detailed view of pointer array 300 and buffer pool 400 is shown in
Each pointer field, if in use, specifies the location of the corresponding flit. If the value of the pointer field, P, is in the range [0,B-1], where B is the number of buffers in the buffer pool, then the flit is located in buffer P of buffer pool 400. On the other hand, if P=B the pointer indicates that the corresponding flit is located in the off-chip flit-buffer array. Flits in the off-chip flit buffer array are located by VCID and flit number, not by pointer.
The use of this structure is illustrated in the example of
To see the advantage of flit buffer caching, consider our example system with V=2560 virtual channels each with S=4 flit buffers containing F=576-bit flits. A conventional buffer organization (
The input channel controller uses a dual-threshold buffer management algorithm to ensure an adequate supply of buffers in the pool. Whenever the number of free buffers in the pool falls below a threshold, e.g., 32, the channel controller begins evicting flits from the buffer pool 400 to the off-chip buffer array 200 and updating pointer array 300 to indicate the change. Once flit eviction begins, it continues until the number of free buffers in the pool exceeds a second threshold, e.g., 64. During eviction, the flits to be evicted can be selected using any algorithm: random, first available, or least recently used. While an LRU algorithm gives slightly higher performance, the eviction event is so rare that the simplest possible algorithm suffices. The eviction process is necessary to keep the upstream controller from waiting indefinitely for a flit to depart from the buffer. Without eviction, this waiting would create dependencies between unrelated virtual channels, possibly leading to tree saturation or deadlock.
Prior routers, such as the one described in the above mentioned pending PCT application, employ credit-based flow control where the upstream controller keeps a credit count 73 (
With flit-buffer caching, the upstream output channel controller flow-control strategy must be modified to avoid oversubscribing this bandwidth in the unlikely event of a buffer-pool overflow. This additional flow control is needed because the off-chip buffer array has much lower bandwidth than the on-chip buffer pool and than the channel itself. Once the pool becomes full, flits must be evicted to the off-chip buffer array and transfers from all VCs must be blocked until space is available in the pool. To handle flit-buffer caching, the prior credit-based flow control mechanism is augmented by adding a buffer-pool credit count 75 that reflects the number of empty buffers in the downstream buffer pool. The upstream controller must have both a non-zero credit count 73 for the virtual channel and a non-zero buffer-pool credit count 75 before it can forward a flit downstream. This ensures that there is space in the buffer pool for all arriving flits. Initially, maximum counts are set for all virtual channels and for the buffer pool which is shared by all VCs. Each time the upstream controller forwards a flit it decrements both the credit count for the corresponding VC and the shared buffer-pool credit count. When the downstream controller sends a credit upstream for any VC, it sets a pool bit in the credit if the flit being credited was sent from the buffer pool. When the upstream controller receives a credit with the pool bit set, it increments the buffer-pool credit count as well as the VC credit count. With eviction of flits from the buffer pool to the off-chip buffer array, special pool-only credit is sent to the upstream controller to update its credit-count to reflect the change. Thus, transfer of new flits is stopped only while transferring flits between the buffer pool and the off-chip buffer array.
Alternatively, blocking flow control can be employed to prevent pool overrun rather than credit-based flow control. With this approach, a block bit in all upstream credits is set when the number of empty buffers in the buffer pool falls below a threshold. When this bit is set, the upstream output controller is inhibited from sending any flits downstream. Once the number of empty buffers is increased over the threshold, the block bit is cleared and the upstream controller may resume sending flits. Blocking flow control is advantageous because it does not require a pool credit counter in the upstream controller and because it can be used to inhibit flit transmission for other reasons.
An alternative preferred embodiment dispenses with the pointer array and instead uses a set-associative cache organization as illustrated in
The allowed location for a particular flit F={V:B} in the cache is determined by the low order bits of F. For example, consider a case with V=2560 virtual channels, S=4 buffers per virtual channel, and C=128 entries in each of A=2 cache arrays. In this case, the flit identifier F is 14-bits, 12-bits of VCID, and 2-bits of buffer identifier, B. The seven-bit cache array index is constructed by appending B to the low-order 5-bits of V, I={V[4:0]:B}. The remaining seven high-order bits of V are then used for the cache tag, T=V[11:5].
When a flit arrives over the channel, its VCID is used to index the state array 500 to read the L field. This field is then incremented to determine the buffer location, B, within the virtual channel, into which the flit is to be stored. The array index, I, is then formed by concatenating B with the low 5-bits of the VCID and this index is used to access the cache arrays 600. While two cache arrays are shown, it is understood that any number of arrays may be employed. One of the cache arrays is then selected to receive this flit using a selection algorithm. Preference may be given to an array that contains an invalid entry in this location, or, if all entries are valid, the array that contains the location that has been least recently used. If the selected cache array already contains a valid flit, this flit is first read out into the eviction FIFO 800 along with its identity (VCID and B). The arriving flit is then written into the vacated location, the tag on that location is updated with the high-bits of the VCID, and the location is marked valid.
When a request comes to read the next flit from a particular virtual channel, the VCID is again used to index the state array 500, and the F field is read to determine the buffer location, B, to be read. The F field is then incremented and written back to the state array. The VCID and buffer number, B, are then used to search for the flit in three locations. First, the index is formed as above I={V[4:0],B} and the cache arrays are accessed. The tag accessed from each array is compared to V[11:5] using comparator 701. If there is a match and the valid bit is set, then the flit has been located and is read out of the corresponding array via tri-state buffer 702. The valid bit of the entry containing the flit is then cleared to free this entry for later use.
If the requested flit is not found in any of the cache arrays, the eviction FIFO is then searched to determine if it contains a valid entry with matching VCID and B. If a match is found, the flit is read out of the eviction FIFO and the valid bit of the entry is cleared to free the location. Finally, if the flit is not found in either the cache arrays or the eviction FIFO, off-chip flit array 200 is read at location {V[11:0],B} to retrieve the flit from backing store.
As with the first preferred embodiment, flow control must be employed and eviction performed to make sure that requests from the upstream controller do not overrun the eviction FIFO. Either credit-based or blocking flow control can be employed as described above to prevent the upstream controller from sending flits when free space in the eviction FIFO falls below a threshold. Flits from the eviction FIFO are also written back to the off-chip flit array 200 when this threshold is exceeded.
Compared to the first preferred embodiment, the set associative organization requires fewer on-chip memory bits to implement but is likely to have a lower hit rate due to conflict misses. For the example numbers above, the state array requires 2560×5=12,800 bits of storage, and cache arrays with 256 flit-sized entries requires 256×(576+1+7)=149,504 bits, and an eviction FIFO with 16 entries requires 1633 (576+14)=9,440 bits for a total of 171,744 bits compared to 241,600 bits for the first preferred embodiment.
Conflict misses occur because a flit cannot reside in any flit buffer, but rather only in a single location of each array (a set of buffers). Thus, an active flit may be evicted from the array before all buffers are full because several other flits map to the same location. However, the effect of these conflict misses are mitigated somewhat by the associative search of the eviction FIFO which acts as a victim cache (see Jouppi, “Improving Direct Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proceedings of 17th Annual International Symposium on Computer Architecture, 1990, pp 364-375.
The storage requirements of a flit-cache can be reduced further using a third preferred embodiment illustrated in
When a flit arrives at the buffer, the entry for the flit's virtual channel is located and brought on chip if it is not already there. The entry is then updated to insert the new flit and update the L field. Specifically, the VCID field of the arriving flit is used to search for the virtual channel entry in three locations. First, the cache arrays are searched by using the low bits of the VCID, e.g., V[6:0], as an index and the high bits of the VCID, e.g., V[11:7] as a tag. If the stored tag matches the presented tag and the valid bit is set, the entry has been found. In this case, the L field of the matching entry is read and used to select the location within the entry to store the arriving flit. The L entry is then incremented and written back to the entry.
In the case of a cache miss, no match in the cache arrays, one of the cache arrays is selected to receive the required entry as described above. If the selected entry is currently valid holding a different entry, that entry is evicted to the eviction FIFO 1000. The eviction FIFO is then searched for the required entry. If found, it is loaded from the buffer into the selected cache array and updated as described above. If the entry is not found in the eviction FIFO, it is fetched from the off-chip buffer array. To allow other arriving flits to be processed while waiting for an entry to be loaded from off chip, the pending flit is temporarily stored in a miss holding register (1002) until the off-chip reference is complete. Once the entry has been loaded from off-chip, the update proceeds as described above.
To read a flit out of the cache given a VCID, the search proceeds in a manner similar to the write. The virtual channel entry is loaded into the cache, from the eviction FIFO or off-chip buffer array, if it is not already there. The F field of the entry is used to select the flit within the entry for readout. Finally, the F field is incremented and written back to the entry.
As with the other preferred embodiments, flow control, either blocking or credit-based is required to stop the upstream controller from sending flits when the number of empty locations in either the eviction FIFO or the miss holding registers fall below a threshold value.
The advantage of the third preferred embodiment is the small size of its on-chip storage arrays when there are very large numbers of virtual channels. Because there are no per-virtual-channel on-chip arrays, the size is largely independent of the number of virtual channels (the size of the tag field does increase logarithmically with the number of virtual channels). For example for our example parameters, V=2560, S=4, and F=576, an on-chip array with 64 entries (256 flits) contains 64 (((S(F)+12)=148,224 bits. An eviction FIFO and miss holding register array with 16 entries each add an additional 37,152 and 9,328 bits respectively. The total amount of on-chip storage, 194,704 is slightly larger than the 171,744 for the second preferred embodiment, but remains essentially constant as the number of virtual channels is increased beyond 2560.
While we have described particular arrangements of the flit cache for the three preferred embodiments, one skilled in the art of fabric router design will understand that many alternative arrangements and organizations are possible. For example, while we have described pointer-based and set-associative organizations, one could also employ a fully-associative organization (particularly for small cache sizes), a hash table, or a tree-structured cache. While we have described cache block sizes of one flit and one virtual channel, other sizes are possible. Also, while we have described a particular encoding of virtual-channel state with the fields F, L, and E, many other encodings are possible. Moreover, while we have cached only the contents of flits and the input virtual channel state, the caching could be extended to cache the output port associated with a virtual channel and the output virtual channel state.
While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. For example, though the preferred embodiments provide flit buffers in fabric routers, the invention can be extended to other information units, such as packets and messages, in other routers.
This application is a continuation of U.S. application Ser. No. 09/316,699, filed May 21, 1999. The entire teachings of the above application are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 09316699 | May 1999 | US |
Child | 10926122 | Aug 2004 | US |