1. Field of the Invention
This invention is related to the field of packet communications in electronic systems such as computer systems, and to routing packet traffic in such systems.
2. Description of the Related Art
Systems that implement packet communications (as opposed to shared bus communications) often implement point-to-point interconnect between nodes in the system. For convenience, a link between nodes will be referred to herein. Each link is one communication path between nodes, and a packet can be transmitted on the link. The link can be one way or two way.
In a multinode system, each node typically includes circuitry to interface to Is multiple other nodes. For example, 3 or 4 links can be supported from a given node to connect to other nodes. However, if fewer than the maximum number of nodes is included in a given system, then links on a particular node can be idle. Bandwidth that could otherwise be used to communicate in the system is wasted.
One packet-based link interconnect is specified in the HyperTransport™ (HT) specification for I/O interconnect. A corresponding coherent HT (cHT) specification also exists. Packets on HT and cHT travel in different virtual channels to provide deadlock free operation. Specifically, posted request, non-posted request, and response virtual channels are provided on HT, and cHT includes those virtual channels and the probe virtual channel. Routing of packets can be based on virtual channel according to the HT specification, and thus different packets to the same node but in different virtual channels can be routed on different links. If those links are all coupled to the same other node, some of the wasted bandwidth can be reclaimed.
Unfortunately, the use of multiple links for different virtual channels does not lead to even use of bandwidth on the links. Responses are more frequent than requests (e.g. several occur per coherent request). Frequently, responses include data since the responses to read requests (the most frequent requests) carry the data. For block-sized responses, the data is significant larger than the non-data carrying responses, requests, and probes. Additionally, the packets transmitted in a given virtual channel may be bursty, and thus bandwidth on other links goes unused while the bursty channel travels over one link.
In one embodiment, a node comprises a plurality of interface circuits coupled to a node controller. Each of the plurality of interface circuits is configured to couple to a respective link of a plurality of links. The node controller is configured to select a first link from two or more of the plurality of links to transmit a first packet, wherein the first link is selected responsive to a relative amount of traffic transmitted via each of the two or more of the plurality of links. A system comprising two or more of the nodes is also contemplated.
In an embodiment, a method comprises receiving a first packet in a node controller within a node that is configured to couple to a plurality of links; and selecting a link from two or more of the plurality of links to transmit the first packet, wherein the selecting is responsive to a relative amount of traffic transmitted via each of the two or more of the plurality of links.
In another embodiment, a node comprises a plurality of interface circuits coupled to a node controller. Each of the plurality of interface circuits is configured to couple to a respective link of a plurality of links. The node controller comprises a routing table programmed to select among the plurality of links to transmit each packet of a plurality of packets, wherein the routing table is programmed to select among the plurality of links responsive to one or more packet attributes of each packet. The node controller is further configured to select a first link from two or more of the plurality of links for at least a first packet of the plurality of packets. The node controller is configured to transmit the first packet using the first link instead of a second link indicated by the routing table for the first packet.
The following detailed description makes reference to the accompanying drawings, which are now briefly described.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
Turning now to
Processing nodes 312A-312D implement a packet-based interface for inter-processing node communication. In the present embodiment, the interface is implemented as sets of unidirectional links (e.g. links 324A are used to transmit packets from processing node 312A to processing node 312B and links 324B are used to transmit packets from processing node 312B to processing node 312A). Other sets of links 324C-324H are used to transmit packets between other processing nodes as illustrated in
Generally, the packets may be transmitted as one or more bit times on the links 324 between nodes. A given bit time may be referenced to the rising or falling edge of the clock signal on the corresponding clock lines. That is, both the rising and the falling edges may be used to transfer data, so that the data rate is double the clock frequency (double data rate, or DDR). The packets may include request packets for initiating transactions, probe packets for maintaining cache coherency, and response packets for responding to probes and requests (and for indicating completion by the source/target of a transaction). Some packets may indicate data movement, and the data being moved may be included in the data movement packets. For example, write requests include data. Probe responses with modified data and read responses both include data. Thus, in general, a packet may include a command portion defining the packet, its source and destination, etc. A packet may optionally include a data portion following the command portion. The data may be a cache block in size, for coherent cacheable operations, or may be smaller (e.g. for non-cacheable reads/writes). A block may be the unit of data for which coherence is maintained. That is, the block of data is treated as a unit for coherence purposed. Coherence state is maintained for the unit as a whole (and thus, if a byte is written in the block, then the entire block is considered modified, for example). A block may be a cache block, which is the unit of allocation or deallocation in the caches, or may differ in size from a cache block.
Processing nodes 312A-312D, in addition to a memory controller and interface logic, may include one or more processors. Broadly speaking, a processing node comprises at least one processor and may optionally include a memory controller for communicating with a memory and other logic as desired. One or more processors may comprise a chip multiprocessing (CMP) or chip multithreaded (CMT) integrated circuit in the processing node or forming the processing node, or the processing node may have any other desired internal structure. Any level of integration or any number of discrete components may form a node. Other types of nodes may include any desired circuitry and the circuitry for communicating on the links. For example, the I/O devices 320A-320B may be I/O nodes, in one embodiment. Generally, a node may be treated as a unit for coherence purposes. Thus, the coherence state in the coherence scheme may be maintained on a per-node basis. Within the node, the location of a given coherent copy of the block may be maintained in any desired fashion, and there may be more than one copy of the block (e.g. in multiple cache levels within the node).
Memories 314A-314D may comprise any suitable memory devices. For example, a memory 314A-314D may comprise one or more RAMBUS DRAMs (RDRAMs), synchronous DRAMs (SDRAMs), DDR SDRAM, static RAM, etc. The address space of computer system 300 is divided among memories 314A-314D. Each processing node 312A-312D may include a memory map used to determine which addresses are mapped to which memories 314A-314D, and hence to which processing node 312A-312D a memory request for a particular address should be routed. In one embodiment, the coherency point for an address within computer system 300 is the memory controller 316A-316D coupled to the memory storing bytes corresponding to the address. In other words, the memory controller 316A-316D is responsible for ensuring that each memory access to the corresponding memory 314A-314D occurs in a cache coherent fashion. Memory controllers 316A-316D may comprise control circuitry for interfacing to memories 314A-314D. Additionally, memory controllers 316A-316D may include request queues for queuing memory requests.
Generally, interface circuits 318A-318L may comprise a variety of buffers for receiving packets from the link and for buffering packets to be transmitted upon the link. Computer system 300 may employ any suitable flow control mechanism for transmitting packets. For example, in one embodiment, each interface circuit 318 stores a count of the number of each type of buffer within the receiver at the other end of the link to which that interface logic is connected. The interface logic does not transmit a packet unless the receiving interface logic has a free buffer to store the packet. As a receiving buffer is freed by routing a packet onward, the receiving interface logic transmits a message to the sending interface logic to indicate that the buffer has been freed. Such a mechanism may be referred to as a “coupon-based” system.
I/O devices 320A-320B may be any suitable I/O devices. For example, I/O devices 320A-320B may include devices for communicating with another computer system to which the devices may be coupled (e.g. network interface cards or modems). Furthermore, I/O devices 320A-320B may include video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards, sound cards, and a variety of data acquisition cards such as GPIB or field bus interface cards. Furthermore, any I/O device implemented as a card may also be implemented as circuitry on the main circuit board of the system 300 and/or software executed on a processing node. It is noted that the term “I/O device” and the term “peripheral device” are intended to be synonymous herein.
In one embodiment, the links 324A-324H are compatible with the HyperTransport™ (HT) specification promulgated by the HT consortium, specifically version 3. The protocol on the links is modified from the HT specification to support coherency on the links, as described above. For the remainder of this discussion, HT links will be used as an example (and the interface circuits 318A-318L may be referred to as HT ports). However, other embodiments may implement any links and any protocol thereon. Additionally, processing nodes may be used as an example of nodes participating in the cache coherence scheme (coherent nodes). However, any coherent nodes may be used in other embodiments.
The nodes 10A-10D may implement a packet distribution mechanism to more evenly consume the available bandwidth on two or more links between the same two nodes. Each node may be configured to select a link (and thus an interface circuit that couples to that link) on which to transmit a packet. If the packet may be transmitted on one of two or more links, the node may select a link dependent on the relative amount of traffic that has been transmitted on each of the corresponding links. Traffic may be measured in any desired fashion (e.g. numbers of packets transmitted, number of bytes transmitted, or any other desired measurement). By incorporating the amount of traffic that has been transmitted on each eligible link, the nodes 10A-10D may be more likely to evenly use the available bandwidth on multiple links to the same other node (or close to evenly use the bandwidth). The packet distribution may be independent of virtual channel, packet type, etc. and thus any packets that are enabled for distribution may be distributed over the available links.
Each interface circuit is configured to couple to a different link. Accordingly, the selection of a “link” implies selecting an interface circuit to which the packet is routed. The interface circuit may drive the packet on the link, during use. For convenience, the discussion below may refer to selecting a link, which may be effectively synonymous with selecting an interface circuit that is coupled to that link during use.
In one embodiment, the nodes 10A-10D may implement a routing table that may use one or more packet attributes to identify the link on which the packet should be transmitted. Packet attributes may include any identifiable property or value related to the packet. For example, request packets may include an address of data to be accessed in the request. The address may be a packet attribute, and may be decoded to determine a link (e.g. a link that may result in the packet being routed to the home node for the address). For response packets, the source node may be a packet attribute. Other packet attributes may include virtual channel, destination node, etc. The packet attributes may be used to index the routing table and output a link identifier indicating a link on which the packet is to be routed. If packet distribution is implemented, the link selected according to the packet distribution mechanism may be used instead of the link identified by the routing table. The link identified by the routing table may be one of the links over which packet traffic is being distributed, and thus in some instances the output of the routing table and the packet distribution mechanism may be the same. Viewed in another way, two packets having the same packet attributes that are used to index the routing table 48 may be routed onto different links.
In one embodiment, the output of the routing table may be used for certain destination nodes and the packet distribution mechanism may be used for other destination nodes of packets. That is, nodes to which a given node is connected via two or more links may use the packet distribution mechanism and others may not. In another embodiment, certain links may be grouped together and the packet distribution mechanism may be used for those links.
Turning now to
The node controller 42 may generally be configured to receive packets from the processor cores 40A-40N, the memory controller 316A, and the HT ports 12A-12D and to route those communications to the processor cores 40A-40N, the HT ports 12A-12D, and the memory controller 316A dependent upon the packet type, the address in the packet, etc. The node controller 42 may write received packet-identifying data into the SRQ 44 (from any source). The node controller 42 (and more particularly the scheduler control unit 46) may schedule packets from the SRQ for routing to the destination or destinations among the processor cores 40A-40N, the HT ports 12A-12D, and the memory controller 316A. The processor cores 40A-40N and the memory controller 316A are local packet sources, which may generate new packets to be routed. The processor cores 40A-40N may generate requests, and may provide responses to received probes and other received packets. The memory controller 316A may generate responses for requests received by the memory controller 316A from either the HT ports 12A-12D or the processor cores 40A-40N, probes to maintain coherence, etc. The HT ports 12A-12D are external packet sources. Packets received from the HT ports 12A-12D may be passing through the node 10A (and thus may be routed out through another HT port 12A-12D) or may be targeted at a processor core 40A-40N and/or the memory controller 316A.
In this embodiment, the local packet sources may have an output link assigned by the distribution control unit 52 as the packets are input to the SRQ 44. The output link may be the link indicated by the routing table 48 for the packet (based on one or more packet attributes), or may be one of two or more links over which traffic is being distributed. In this embodiment, packet traffic may be distributed over two or more links for a particular destination node (or nodes) for locally generated packets from local packet sources. Packets from external sources may be routed based on the routing table output (e.g. as checked by the scheduler control unit 46). Distributing only locally generated packet traffic is one embodiment, other embodiments may distribute external packet traffic as well. Only distributing locally generated packet traffic may optimize for two node systems. However, other embodiments may implement traffic distribution for locally generated packets only in multinode systems as well.
For locally generated packets, the distribution control unit 52 may obtain a link assignment from the routing table 48 and may also check the destination node of the packet against the distribution control register 50. If the destination node matches a node listed in the distribution control register 50, the distribution control register 50 may also indicate which of the links are included in the subset of links over which packet traffic is being distributed. The distribution control unit 52 may select one of the links dependent on the relative amount of packet traffic that has been transmitted via each link in the subset. Various algorithms may be used for the selection. One is described in more detail with regard to
The routing table 48 may be programmable with link mappings (e.g. via instructions executed on a processor core 40A-40N or another processor core in another node). Similarly, the distribution control register 50 may be programmable with packet distribution control data. One or more distribution control registers 50 may be included in various embodiments. The routing table 48 and/or distribution control registers 50 may be programmed during system initialization, for example. In other embodiments, distribution control data may be provided by blowing fuses, tying pins, etc.
The distribution control unit 42 may be responsible for tracking packet traffic on the links that are identified in the distribution control data, to aid in the selection of a link on which the packet is transmitted. The distribution control unit 42 may update the traffic measurement data as packets are written to the SRQ 44 in this embodiment (or may update as the packets are transmitted to the interface circuits, in other embodiments).
Generally, the processor cores 40A-40N may use the interface to the node controller 42 to communicate with other components of the computer system. In one embodiment, communication on the interfaces between the node controller 42 and the processor cores 40A-40N may be in the form of packets similar to those used on the HT links. In other embodiments, any desired communication may be used (e.g. transactions on a bus interface, packets of a different form, etc.). Similarly, communication between the memory controller 316A and the node controller 42 may be in the form of HT packets.
When the scheduler control unit 46 has determined that a packet is ready to be scheduled, the scheduler control unit 46 may output data identifying the packet to packet buffers at the HT port 12A-12D, the memory controller 316A, and the processor cores 40A-40N. That is, the packets themselves may be stored at the source (or receiving interface circuit) and may be routed to the destination interface circuit/local source directly. Data used for scheduling may be written into the SRQ 44.
Generally, a processor core 40A-40N may comprise circuitry that is designed to execute instructions defined in a given instruction set architecture. That is, the processor core circuitry may be configured to fetch, decode, execute, and store results of the instructions defined in the instruction set architecture. The processor cores 40A-40N may comprise any desired configurations, including superpipelined, superscalar, or combinations thereof. Other configurations may include scalar, pipelined, non-pipelined, etc. Various embodiments may employ out of order speculative execution or in order execution. The processor core may include microcoding for one or more instructions or other functions, in combination with any of the above constructions. Various embodiments may implement a variety of other design features such as caches, translation lookaside buffers (TLBs), etc.
The routing table 48 may comprise any storage that can be indexed by packet attributes and store interface circuit identifiers. The routing table 48 may be a set of registers, a random access memory (RAM), a content addressable memory (CAM), combinations of the previous, etc.
While the embodiment of
Turning now to
The request, response, and probe enable fields permit enabling/disabling the packet distribution mechanism for different packet types. Other embodiments may include a single enable. Request packets include read and write requests to initiate transactions, as well as certain coherency-related requests (like change to dirty, to write a shared block that a node has cached). Response packets include responses to requests (e.g. read responses with data, probe responses, and responses indicating completion of a transaction). Probe packets are issued by the home node to maintain coherency, causing state change in caching nodes and optionally data movement as well, if a dirty copy exists (or might exist) and is to be forwarded to the requesting node or home node.
The destination node field identifies a destination node to which packets may be directed, and packets to that destination node are to be handled using the packet distribution mechanism. There may be multiple destination node fields to permit multiple destination nodes to be specified, or the destination node field may be encoded to specify more than one node.
A single destination node field that identifies a single node may be used for a two node system, for example. Each node may have the other node programmed into the destination node field of its distribution control register 50. Thus, packets directed to the other node may be distributed over the two or more links between the nodes. Packets not directed to the other node (e.g. packets to I/O nodes) may be routed to the interface circuit indicated by the routing table. In larger systems, the single node may identify another node to which two or more links are coupled, and other nodes may be routed via the routing table. Or, in larger systems, more destination node fields may be provided if there is more than one node to which multiple links are connected from the current node.
The destination link field may specify the two or more links (interface circuits) over which the packets are to be distributed. In one embodiment, the destination link field may by a bit vector, with one bit assigned to each link. If the bit is set, the link is included in the subset of links over which packets are distributed. If the bit is clear, the link in not included. Other embodiments may encode the links in different ways. In one particular embodiment, a link can be logically divided into sublinks (e.g. a 16 bit link could be divided into two independent 8 bit links). In such embodiments, distribution may be over the sublinks.
If more than one destination node is supported, than there may be more than one destination link field (e.g. there may be one destination link field for each supported destination node).
Turning next to
The node controller 42 may determine if distribution is enabled (decision block 60). The decision may be applied on a global basis (e.g. enabled or not enabled), or may be applied on a packet-type basis (e.g. the embodiment of
In this embodiment, the distribution of packet traffic is performed at the time the SRQ 44 is written for a packet. Subsequent scheduling may be performed as normal. For example, each packet may be scheduled based on buffer availability at the receiver on the destination link assigned to that packet (for the coupon based scheme), along with any other scheduling constraints that may exist in various embodiments.
In another embodiment, packet distribution may be performed at the time the packet is scheduled for transmission (e.g. an embodiment is illustrated in
In
In this embodiment, the scheduler control unit 46 may determine an initial destination link for each packet (from any source, local or external, in this embodiment). The scheduler control unit 46 may write an indication of the initial destination link to the SRQ 44 along with other packet-related data. Scheduling may be performed based on this initial destination link as well (e.g. buffer readiness, based on the coupon scheme, etc.). In response to scheduling the packet, the packet data may be provided to the distribution control unit 52. The distribution control unit 52 may override the initial destination link, for some packets, based on the distribution control register 50 and the distribution control algorithm. The distribution control unit 52 may provide an indication of the destination link (either the new destination link or the initial destination link, if no new destination link is provided).
In this embodiment, packets from any source may be distributed. Additionally, in some embodiments, the distribution may be more accurate since the distribution occurs at packet scheduling time (as the packets are being provided to their destinations) and thus the traffic usage data may be more accurate.
Distribution may be affected by destination link, rather than destination node, in this embodiment. Other embodiments may still associate distribution with a defined destination node. An embodiment of the distribution control register 50 is shown in
If distribution over two or more different subsets of links is to be supported, more than one destination link field may be included in the distribution control register 50, as illustrated in
Turning next to
The node controller 42 may map the packet to a destination link using the routing table 48 (block 80). The node controller 42 may write an indication of the destination link and other packet data to the SRQ 44 (block 82).
Turning next to
If distribution is not enabled (decision block 90, “no” leg), the node controller 42 may cause the packet to be transmitted on the initial destination link based on the routing table output (block 92), as read from the SRQ 44. If distribution is enabled (decision block 90, “yes” leg), but the destination link of the packet (as read from the SRQ 44) does not match a destination link programmed into the distribution control register 50 (decision block 94, “no” leg), the node controller 42 cause the packet to be transmitted on the initial destination link (block 92) and write the SRQ 44 (block 70). Otherwise (decision blocks 90 and 94, “yes” legs), the node controller 42 may assign a new destination link based on the distribution control algorithm (block 96). The distribution control algorithm selects a link from the subset of links over which packet traffic is being distributed, dependent on the traffic that has been transmitted on the links in the subset. One embodiment is illustrated in
Turning next to
Generally, the distribution control algorithm may include maintaining one or more traffic measurement values that indicate the relative amount of traffic on the links in the subset. The traffic measurement values may take on any form that directly measures or approximates the amount of traffic. For example, the traffic measurement values may comprise a value for each link, which may comprise a byte count or a packet count. If two links are used in the subset, a single traffic measurement value could be used that is increased for traffic on one link and decreased for traffic on another link.
For this embodiment, a traffic measurement value for each link may be maintained. The traffic measurement value may comprise an M bit counter that is initialized to zero and saturates at the maximum counter value. A packet without data (command only) may increment that counter by one. A packet with data may set the count to the max (since the data portion of the packet is substantially larger than the command portion, for block sized data, in this embodiment). For example, M may be 3, and the maximum amount may be seven.
In addition to the traffic measurement values, the node controller 42 may maintain a pointer identifying the most recently selected link (LastLinkSelected). The algorithm may include a round-robin selection among the links, excluding those that have traffic measurement values that have reached the maximum.
Thus, the node controller 42 may select the next link in the subset of links after the LastLinkSelected (rotating back to the beginning of the destination links field of the distribution control register) that has a corresponding traffic measurement value (TrafficCnt) less than the maximum value of the counter (Max) (block 100) and the selected link may be provided as the destination link, or new link (block 102). The node controller 42 may also update the LastLinkSelected pointer to indicate the selected link (block 104). The node controller 42 may update the TrafficCnt corresponding to the selected link (block 106). If all the TrafficCnts (corresponding to all the links in the subset) are at the Max (decision block 108, “yes” leg), the node controller 42 may set the TrafficCnts to the Min value (e.g. zero) (block 110).
Accordingly, the TrafficCnts represent the relative amount of traffic that has been recently transmitted on the links. Since a link having a TrafficCnt equal to Max is not selected, eventually each link will be selected enough times to reach Max. Accordingly, bandwidth should be relatively evenly consumed over the eligible links.
The above mentioned traffic measurement and selection algorithm is but one possible embodiment. For example, other embodiments may monitor traffic using similar traffic measurements, but may simply select the one indicating the least amount of traffic. Additionally, in cases in which an initial destination link is always determined the same from the routing table 48, the selection may favor other links in the subset (all else being equal) since the initial destination link's buffer availability is used as part of the scheduling decision, while the other links' buffer availability is not used.
It is noted that, in embodiments in which the destination link is selected via the packet distribution mechanism at packet scheduling time, the readiness of the eligible links may also be factored into the selection. That is, a link that cannot currently receive the packet may not be selected.
Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.