The present disclosure generally relates to the field of electronics. More particularly, an embodiment of the invention relates to techniques for provision of adaptive packet deflection to achieve fair, low-cost, and/or energy-efficient Quality of Service (QoS) in Network-on-Chip (NoC) devices.
Some current interconnection networks are used to connect many computing components such as many cores in Chip Multi Processors (CMPs) and many-nodes in clustered systems. Some Network-on-Chip (NoC) prototypes with high core count show that NoCs consume a substantial portion of overall system power. Moreover, with such systems, there can be a diverse set of applications running simultaneously on multiple cores/nodes and the interconnect acts as a shared medium, servicing requests from these cores. As a result, at any given instance, there may exist multiple packet classes (data and control) that belong to multiple applications, originating from different cores/nodes. Each of these packets can have different Quality-of-Service (QoS) requirements. This means that the interconnect policy should be able to support multiple traffic classes, so that the packets that belong to a higher priority class are served with certain QoS requirement (e.g., faster delivery time).
Current NoC-QoS approaches can be lumped into two main categories. First, QoS can be achieved by introducing additional queues to the router (e.g., extra virtual channels), assigning different classes of packets to these different queues, and serving them with different priorities. Although adding additional buffering/queues guarantee (at least to some extent) that all packets are routed through minimum paths, it significantly increases the power budget and associated costs of the interconnect. The second category of NoC-QoS approaches, on the other hand, do not necessarily add additional buffering; however, they require major changes to the router architecture such that the router, instead of maintaining FIFO (First In First Out) queues, is able to pull out any packet from the available queues in any order, and service it based on its priority level. This, therefore, increases the complexity of the interconnect and even worse, it increases its power consumption and cost as well.
The detailed description is provided with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of various embodiments. However, some embodiments may be practiced without the specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to obscure the particular embodiments. Various aspects of embodiments of the invention may be performed using various means, such as integrated semiconductor circuits (“hardware”), computer-readable instructions organized into one or more programs (“software”) or some combination of hardware and software. For the purposes of this disclosure reference to “logic” shall mean either hardware, software, or some combination thereof.
Some embodiments improve the quality and/or performance of high-speed serial I/O channels via various techniques. For example, such techniques are used to provide adaptive packet deflection to achieve fair, low-cost, and/or energy-efficient Quality of Service (QoS) in Network-on-Chip (NoC) devices. As a result, interconnects may be able to support QoS traffic without an increase in power consumption and/or silicon area. Furthermore, some embodiments provide QoS for Network-on-Chips, without changing the router architecture, or requiring support of multiple queues, while at the same time maintaining high overall throughput.
Also, some embodiments are used for both many-core processors and systems with many nodes (such as μCluster based systems), allowing energy efficient and high performance interconnects that could fit within a target power budget. Moreover, QoS support (e.g., without queues used in some current implementations), relatively simple router architecture (resulting in no or little increase in silicon area and without major changes to the router architecture), and/or less buffering area and power consumption (e.g., since the additional queues used in some current implementation are not used).
In an embodiment, NoC and/or QoS support is provided via selectively deflecting low priority packets to avoid/reduce congestion and to guarantee the timely delivery of high priority packets. By selectively deflecting packets, one embodiment does not require any additional buffering, nor does it demand changing of the router architecture; thus, it is capable of achieving QoS with minimal buffering area, and simple router architectures, in various systems such as NoC. This, in turn, reduces the interconnect cost and power consumption. Also, in an embodiment, the interconnect(s) discussed herein are implemented in accordance with PCI Express Base Specification 3.0, Revision 3.0, version 1.0 Nov. 10, 2010 and Errata for the PCI Express Base Specification Revision 3.0, Oct. 20, 2011.
Various embodiments are discussed herein with reference to a computing system component, such as the components discussed herein, e.g., with reference to
As illustrated in
In one embodiment, the system 100 can support a layered protocol scheme, which includes a physical layer, a link layer, a routing layer, a transport layer, and/or a protocol layer. The fabric 104 can further facilitate transmission of data (e.g., in form of packets) from one protocol (e.g., caching processor or caching aware memory controller) to another protocol for a point-to-point network. Also, in some embodiments, the network fabric 104 can provide communication that adheres to one or more cache coherent protocols.
Furthermore, as shown by the direction of arrows in
Also, in accordance with an embodiment, one or more of the agents 102 include one or more routing and switching logic 300 to facilitate communication between an agent (e.g., agent 102-1 shown) and one or more Input/Output (“I/O” or “IO”) devices 124 (such as Peripheral Component Interconnect Express (PCIe) I/O devices, which operate in accordance with PCI Express Base Specification 3.0, Revision 3.0, version 1.0 Nov. 10, 2010 and Errata for the PCI Express Base Specification Revision 3.0, Oct. 20, 2011) and/or other agents coupled via the fabric 104 as will be further discussed herein (e.g., with reference to
In another embodiment, the network fabric may be utilized for any System on Chip (SoC) application, utilize custom or standard interfaces, such as, ARM compliant interfaces for AMBA (Advanced Microcontroller Bus Architecture), OCP (Open Core Protocol), MIPI (Mobile Industry Processor Interface), PCI (Peripheral Component Interconnect) or PCIe (Peripheral Component Interconnect Express).
Some embodiments use a technique that enables use of heterogeneous resources, such as AXI/OCP technologies, in a PC (Personal Computer) based system such as a PCI-based system without making any changes to the IP resources themselves. Embodiments provide two very thin hardware blocks, referred to herein as a Yunit and a shim, that can be used to plug AXI/OCP IP into an auto-generated interconnect fabric to create PCI-compatible systems. In one embodiment a first (e.g., a north) interface of the Yunit connects to an adapter block that interfaces to a PCI-compatible bus such as a direct media interface (DMI) bus, a PCI bus, or a Peripheral Component Interconnect Express (PCIe) bus. A second (e.g., south) interface connects directly to a non-PC interconnect, such as an AXI/OCP interconnect. In various implementations, this bus may be an OCP bus.
In some embodiments, the Yunit implements PCI enumeration by translating PCI configuration cycles into transactions that the target IP can understand. This unit also performs address translation from re-locatable PCI addresses into fixed AXI/OCP addresses and vice versa. The Yunit may further implement an ordering mechanism to satisfy a producer-consumer model (e.g., a PCI producer-consumer model). In turn, individual IPs are connected to the interconnect via dedicated PCI shims. Each shim may implement the entire PCI header for the corresponding IP. The Yunit routes all accesses to the PCI header and the device memory space to the shim. The shim consumes all header read/write transactions and passes on other transactions to the IP. In some embodiments, the shim also implements all power management related features for the IP.
Thus, rather than being a monolithic compatibility block, embodiments that implement a Yunit take a distributed approach. Functionality that is common across all IPs, e.g., address translation and ordering, is implemented in the Yunit, while IP-specific functionality such as power management, error handling, and so forth, is implemented in the shims that are tailored to that IP.
In this way, a new IP can be added with minimal changes to the Yunit. For example, in one implementation the changes may occur by adding a new entry in an address redirection table. While the shims are IP-specific, in some implementations a large amount of the functionality (e.g., more than 90%) is common across all IPs. This enables a rapid reconfiguration of an existing shim for a new IP. Some embodiments thus also enable use of auto-generated interconnect fabrics without modification. In a point-to-point bus architecture, designing interconnect fabrics can be a challenging task. The Yunit approach described above leverages an industry ecosystem into a PCI system with minimal effort and without requiring any modifications to industry-standard tools.
As shown in
Furthermore, one implementation (such as shown in
Some current implementations of NoCs in CMPs or System Area Networks (SANs) have various topologies, including mesh, torus, and irregular mesh toplogies. Such topologies usually feature a relatively large node connection degree (e.g., 4 in torus topology). As a result, a flow or a packet can potentially choose one of the multiple paths between a given source-destination pair. In adaptive routing, when a packet encounters a faulty or congested path, it can select another bypassing path, even it is longer in some cases. This allows for a balance of network traffic, and can potentially improve throughput and latency. However, current adaptive routing approaches generally treat all packets equally without consideration for QoS support for traffic with different service levels (priority).
As mentioned before, an NoC can be used as a shared medium connecting and servicing all cores/nodes on the chip. This means that at any point of time, there can be multiple messages of different node origins and different types being communicated. For example, some messages can be related to control signaling which have higher priority than other messages. Further, different applications can have different service level requirement (real time vs. best effort). Without losing generosity, some embodiments assume there are N classes of traffic, with priority 1 (e.g., being the highest priority), which should be delivered in very timely fashion.
Some current implementations provide QoS support for different classes by having separate queues in the router, and serving, according to the service level agreement, different queues based on priority. For example, a higher priority queue receives more serving time and is able to preempt lower priority packets. One major reason why this class of methods introduces additional separate queues is the fact that traditional router architecture is not capable of fetching packets from a single FIFO queue in an out-of-order fashion; hence, the additional dedicated per-class queues are provided. The drawback of this category of techniques is that multiple-queue support usually increases the total required buffering, which in turn increases both area (e.g., for additional queues and any supporting logic) and power consumption of the router (e.g., to operated the additional logic for the queues and their supporting logic). With the increasing number of nodes/cores in the system, this would become a severe problem. Some studies have also shown that in general the interconnect load can be relatively low, thus larger buffering may waste energy and area.
Moreover, another class of methods addresses the problem of the router not being able to fetch packets from the queues in an out-of-order fashion by foregoing the traditional, simple router architecture and introducing a sophisticated one that is capable of pulling out packets from a given queue in no particular order (i.e., without respecting their FIFO order). Although these approaches do not introduce additional buffering, they still need a complex router design that by itself consumes significant amount of power and increases the interconnect cost and area. By contrast, some embodiments provide for interconnect QoS support without requiring multiple queues nor changing the router architecture.
In an embodiment, when the network is lightly loaded, all packets are routed through minimal paths to achieve minimal routing latency and energy consumption. When the utilization at certain ports increases (e.g., when compared with threshold utilization value(s), it can be based on information detected by one or more sensors proximate to the ports), the logic 304 will selectively deflect one or more lower priority packets to other ports to avoid or reduce further congestion, even if the other ports are not the ports on the minimal path to destination. This approach in turn reduces or avoids further congestion for the higher priority traffic and their timely delivery. Moreover, since this approach balances the load on different ports, higher overall network throughput can be achieved as well. To avoid live-locks (i.e., where a packet is being deflected repeatedly), the priority of packets being deflected, is gradually increased with every packet deflection. This in turn can guarantee that even for a low priority packet it will eventually be delivered to its destination, instead of being deflected endlessly.
Referring to
(1) utilization of the target port—the higher the utilization, the higher the deflection probability (the reason for this is to prevent saturation and congestion for the target port, and to ensure the timely delivery of the higher priority packets); and/or
(2) priority level of the packet—the lower priority packet is more likely to be deflected (while, on the other hand, a packet with higher priority is less likely to be deflected).
Accordingly, a probability-based deflection mechanism is used to provide for QoS in some embodiments. For example, the deflection probability p is calculated using the following equation:
P=a×(targetPortUutilization−utilizationThreshold)×(n/N)
where “a” is a scaling factor (can be experimentally decided), “n” is the priority of the packet [1, 2, 3, . . . ], with 1 being the highest priority, and “N” is the total number of traffic classes.
Using this equation, a lower priority packet has a higher chance to be deflected when the utilization value of the target port is higher than a threshold value. When choosing an alternate port to deflect to, the alternate port with lower utilization and one that also does not increase the hop count to the destination is considered first in an embodiment. Another important aspect is to ensure that even lower priority packets are delivered eventually to the destination, e.g., by applying an aging mechanism. For example, every time a packet is deflected as determined at operation 408, its priority is increased by a certain value at an operation 410 such that it will be less likely to be deflected at the next hop. The width of the priority increment can be a design parameter. At an operation 412, the packet is sent to a non-target port. As shown in
In one embodiment, to avoid or at least reduce the chance for live-lock a “deflection-counter” counts the number of deflection a packet encounters, and this counter value is taken into consideration when calculating the deflection probability discussed above. For example, the higher the value of the deflection counter, the lower the deflection probability. Also, even where all packets have the same priority (N=1), some embodiments still serve as a efficient and low-overhead approach to disperse congestion and achieve overall high throughput.
Furthermore, the target QoS level can be represented by assigning priority values to packets. These priority values could either have a user-level meaning or implementation (e.g., a user selected QoS level for a given application can be propagated down to the interconnect through the OS), a hardware-level meaning or implementation (e.g., the hardware gives control packets higher priority than data packets), or a combination of both.
Also, the operations discussed with reference to
A chipset 506 also communicates with the interconnection network 504. The chipset 506 includes a graphics and memory controller hub (GMCH) 508. The GMCH 508 includes a memory controller 510 that communicates with a memory 512. The memory 512 stores data, including sequences of instructions that are executed by the CPU 502, or any other device included in the computing system 500. For example, the memory 512 stores data corresponding to an operation system (OS) 513 and/or a device driver 511 as discussed with reference to the previous figures. In an embodiment, the memory 512 and memory 140 of
Additionally, in some embodiments, one or more of the processors 502 have access to one or more caches (which can include private and/or shared caches in various embodiments) and associated cache controllers (not shown). The cache(s) can adhere to one or more cache coherent protocols. The cache(s) store data (e.g., including instructions) that are utilized by one or more components of the system 500. For example, the cache locally caches data stored in a memory 512 for faster access by the components of the processors 502. In an embodiment, the cache (that can be shared) can include a mid-level cache and/or a last level cache (LLC). Also, each processor 502 includes a level 1 (L1) cache. Various components of the processors 502 communicate with the cache directly, through a bus or interconnection network, and/or a memory controller or hub.
The GMCH 508 also includes a graphics interface 514 that communicates with a display device 516, e.g., via a graphics accelerator. In one embodiment of the invention, the graphics interface 514 can communicate with the graphics accelerator via an accelerated graphics port (AGP). In an embodiment of the invention, the display 516 (such as a flat panel display) can communicate with the graphics interface 514 through, for example, a signal converter that translates a digital representation of an image stored in a storage device such as video memory or system memory into display signals that are interpreted and displayed by the display 516. The display signals produced by the display device pass through various control devices before being interpreted by and subsequently displayed on the display 516.
A hub interface 518 allows the GMCH 508 and an input/output control hub (ICH) 520 to communicate. The ICH 520 provides an interface to I/O devices that communicate with the computing system 500. The ICH 520 communicates with a bus 522 through a peripheral bridge (or controller) 524, such as a peripheral component interconnect (PCI) bridge, a universal serial bus (USB) controller, or other types of peripheral bridges or controllers. The bridge 524 provides a data path between the CPU 502 and peripheral devices. Other types of topologies can be utilized. Also, multiple buses can communicate with the ICH 520, e.g., through multiple bridges or controllers. Moreover, other peripherals in communication with the ICH 520 include, in various embodiments of the invention, integrated drive electronics (IDE) or small computer system interface (SCSI) hard drive(s), USB port(s), a keyboard, a mouse, parallel port(s), serial port(s), floppy disk drive(s), digital output support (e.g., digital video interface (DVI)), or other devices.
The bus 522 communicates with an audio device 526, one or more disk drive(s) 528, and a network interface device 530 (which is in communication with the computer network 503). Other devices communicate via the bus 522. Also, various components (such as the network interface device 530) can communicate with the GMCH 508 in some embodiments of the invention. In an embodiment, the processor 502 and one or more components of the GMCH 508 and/or chipset 506 are combined to form a single integrated circuit chip (or be otherwise present on the same integrated circuit die).
Furthermore, the computing system 500 can include volatile and/or nonvolatile memory (or storage). For example, nonvolatile memory includes one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically EPROM (EEPROM), a disk drive (e.g., 528), a floppy disk, a compact disk ROM (CD-ROM), a digital versatile disk (DVD), flash memory, a magneto-optical disk, or other types of nonvolatile machine-readable media that are capable of storing electronic data (e.g., including instructions).
As illustrated in
In an embodiment, the processors 602 and 604 can be one of the processors 502 discussed with reference to
At least one embodiment of the invention is provided within the processors 602 and 604 or chipset 620. For example, the processors 602 and 604 and/or chipset 620 include one or more routing and switching logic 300. Other embodiments of the invention, however, can exist in other circuits, logic units, or devices within the system 600 of
The chipset 620 communicates with a bus 640 using a PtP interface circuit 641. The bus 640 has one or more devices that communicate with it, such as a bus bridge 642 and I/O devices 643. Via a bus 644, the bus bridge 642 communicate with other devices such as a keyboard/mouse 645, communication devices 646 (such as modems, network interface devices, or other communication devices that communicate through the computer network 503), audio I/O device, and/or a data storage device 648. The data storage device 648 stores code 649 that may be executed by the processors 602 and/or 604.
In various embodiments of the invention, the operations discussed herein, e.g., with reference to
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least an implementation. The appearances of the phrase “in one embodiment” in various places in the specification may or may not be all referring to the same embodiment.
Also, in the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. In some embodiments of the invention, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements may not be in direct contact with each other, but may still cooperate or interact with each other.
Thus, although embodiments of the invention have been described in language specific to structural features and/or methodological acts, it is to be understood that claimed subject matter may not be limited to the specific features or acts described. Rather, the specific features and acts are disclosed as sample forms of implementing the claimed subject matter.