1. Technical Field
Methods and example embodiments described herein are generally directed to interconnect architecture, and more specifically, to network on chip system interconnect architecture.
2. Related Art
The number of components on a chip is rapidly growing due to increasing levels of integration, system complexity and shrinking transistor geometry. Complex System-on-Chips (SoCs) may involve a variety of components e.g., processor cores, DSPs, hardware accelerators, memory and I/O, while Chip Multi-Processors (CMPs) may involve a large number of homogenous processor cores, memory and I/O subsystems. In both systems the on-chip interconnect plays a key role in providing high-performance communication between the various components. Due to scalability limitations of traditional buses and crossbar based interconnects, Network-on-Chip (NoC) has emerged as a paradigm to interconnect a large number of components on the chip.
NoC is a global shared communication infrastructure made up of several routing nodes interconnected with each other using point-to-point physical links. Messages are injected by the source component and are routed from the source components to the destination over multiple intermediate nodes and physical links. The destination component then ejects the message and provides it to the destination component. For the remainder of the document, terms ‘components’, ‘blocks’ hosts' or ‘cores’ will be used interchangeably to refer to the various system components which are interconnected using a NoC. Terms ‘routers’ and ‘nodes’ will also be used interchangeably. Without loss of generalization, the system with multiple interconnected components will itself be referred to as ‘multi-core system’.
There are several possible topologies in which the routers can connect to one another to create the system network. Bi-directional rings (as shown in
Packets are message transport units for intercommunication between various components. Routing involves identifying a path which is a set of routers and physical links of the network over which packets are sent from a source to a destination. Components are connected to one or multiple ports of one or multiple routers; with each such port having a unique ID. Packets carry the destination's router and port ID for use by the intermediate routers to route the packet to the destination component.
Examples of routing techniques include deterministic routing, which involves choosing the same path from A to B for every packet. This form of routing is oblivious of the state of the network and does not load balance across path diversities which might exist in the underlying network. However, such deterministic routing may be simple to implement in hardware, maintains packet ordering and may be easy to make free of network level deadlocks. Shortest path routing minimizes the latency as it reduces the number of hops from the source to destination. For this reason, the shortest path is also the lowest power path for communication between the two components. Dimension-order routing is a form of deterministic shortest path routing in 2D mesh networks.
Source routing and routing using tables are other routing options used in NoC. Adaptive routing can dynamically change the path taken between two points on the network based on the state of the network. This form of routing may be complex to analyze and implement and is therefore rarely used in practice.
A NoC may contain multiple physical networks. Over each physical network, there may exist multiple virtual networks, wherein different message types are transmitted over different virtual networks. In this case, at each physical link or channel, there are multiple virtual channels; each virtual channel may have dedicated buffers at both end points. In any given clock cycle, only one virtual channel can transmit data on the physical channel.
NoC interconnects often employ wormhole routing, wherein, a large message or packet is broken into small pieces known as flits (also referred to as flow control digits). The first flit is the header flit which holds information about this packet's route and key message level info along with payload data and sets up the routing behavior for all subsequent flits associated with the message. Zero or more body flits follows the head flit, containing the remaining payload of data. The final flit is tail flit which in addition to containing the last payload also performs some bookkeeping to close the connection for the message. In wormhole flow control, virtual channels are often implemented.
The physical channels are time sliced into a number of independent logical channels called virtual channels (VCs). VCs provide multiple independent paths to route packets, however they are time-multiplexed on the physical channels. A virtual channel holds the state needed to coordinate the handling of the flits of a packet over a channel. At a minimum, this state identifies the output channel of the current node for the next hop of the route and the state of the virtual channel (idle, waiting for resources, or active). The virtual channel may also include pointers to the flits of the packet that are buffered on the current node and the number of flit buffers available on the next node.
The term “wormhole” plays on the way messages are transmitted over the channels: the output port at the next router can be so short that received data can be translated in the head flit before the full message arrives. This allows the router to quickly set up the route upon arrival of the head flit and then opt out from the rest of the conversation. Since a message is transmitted flit by flit, the message may occupy several flit buffers along its path at different routers, creating a worm-like image.
Based upon the traffic between various end points, and the routes and physical networks that are used for various messages, different physical channels of the NoC interconnect may experience different levels of load and congestion. The capacity of various physical channels of a NoC interconnect is determined by the width of the channel (number of physical wires) and the clock frequency at which it is operating. Various channels of the NoC may operate at different clock frequencies. However, all channels are equal in width or number of physical wires. This width can be determined based on the most loaded channel and the clock frequency of various channels.
Aspects of the example implementations may include a method, involving determining and/or adjusting a width for each of a plurality of channels in a network on chip (NoC) based on at least one performance objective for the each of the plurality of channels or a maximum data flow of the each of the plurality of channels, such that at least one of the plurality of channels has a different width than at least another one of the plurality of channels.
Additional aspects of the example implementations may further include a computer readable storage medium storing instructions for executing a process, involving determining and/or adjusting a width for each of a plurality of channels in a network on chip (NoC) based on at least one performance objective for the each of the plurality of channels or a maximum data flow of the each of the plurality of channels, such that at least one of the plurality of channels has a different width than at least another one of the plurality of channels.
Additional aspects of the example implementations may further include a system, which includes a width adjustment module configured to determine and/or adjust a width for each of a plurality of channels in a network on chip (NoC) based on at least one performance objective for the each of the plurality of channels or a maximum data flow of the each of the plurality of channels, such that at least one of the plurality of channels has a different width than at least another one of the plurality of channels.
a) and 1(b) illustrate examples of Bidirectional ring and 2D Mesh NoC Topologies.
a) and 3(b) illustrates an example of a ring and 2D mesh NoC with asymmetric channel widths/sizes as indicated by the link widths, in accordance with an example implementation.
Complex traffic profiles in a SoC can create uneven load on various channels of the interconnect that connects various components of the SoC. Example embodiments described herein are based on the concept of constructing interconnect with heterogeneous channel capacities (number of wires) for a specified inter-block communication pattern in the system. An example process of the automatic construction of the NoC interconnect is also disclosed.
The load on various channels of NoC interconnect depends upon the rate at which various components are sending messages, the topology of the NoC interconnect, how various components are connected to the NoC nodes, and the path various messages are taking in the NoC. Channels may be uniformly sized in number of wires across the entire NoC to avoid the reformatting of messages within the NoC nodes as they travel over various channels. In such cases case, to avoid congestion, all channels may be sized based on the most loaded channel in the NoC. Load balancing of channels can be performed by routing messages over less loaded paths, which reduces the non-uniform loading of various channels and therefore the maximum load. However, there is limited flexibility in choosing different paths. Route paths can have a variety of restrictions such as using shortest path, using minimal turn, or lack of path diversity between various components. Therefore, in most SoCs, channels remain non-uniformly loaded, and using the highest channel load to determine the global NoC channel width leads to increased area, power and interconnect cost.
Unlike related art systems, NoC interconnects that use homogeneous channel width in number of wires, example implementations disclosed herein are directed to an interconnect design in which various channels may have a different number of wires, leading to non-uniform channel width and/or bandwidth capacities. The channel bandwidth requirement can be determined by computing the maximum data bandwidth that is expected on the channel, based on the traffic profile. If some channels have different width based on their bandwidth requirements, various nodes within the NoC may have to reformat messages as messages traverse between two channels of different widths. A single message flit may need to be sub-divided into multiple smaller flits, or multiple message flits may need to be merged together to form a larger message flit. For example, a single 128-bit flit coming on a 128-bit input channel may have to be broken into two flits of 64-bit each as it goes from a 128-bit channel to 64-bit channel.
An example of such a heterogeneous channel NoC interconnect is shown in
In such a NoC with heterogeneous channel widths, the message reformatting that occurs within the NoC nodes is kept transparent from the end components. Thus, if the destination host is expecting messages of 128-bit flits, then the NoC is maintained such that after several segmentation and reassembly operations along the route, the message is delivered as 128-bit flits to the final destination. Therefore, the channels are sized so that all end-to-end channel widths from end host perspective remain consistent to the original homogeneous interconnect. For example, between a pair of 128-bit transmitting and receiving hosts, the interconnect may reduce the channel to 64-bit or smaller along the route, and increases back to 128-bit at the egress transmit channel of the router at the final destination host.
A few examples of such consistent channel width conversion are shown in
The end-to-end channel width consistency is maintained even when source and destination hosts have different transmit and receive channel widths. For example when a 64-bit transmit channel host sends message to 128-bit receive channel hosts, the message upon delivery is made to be 128-bit flits, even if the message is up- and down-converted along the route.
The width of various channels in the NoC may be determined not only by the bandwidth requirements of all messages traversing over the channel, but also by the latency requirements. Making a channel wider, thus allocating higher channel bandwidth, may not necessarily increase the throughput if channels are already wider than average data rate needed. However, making the channel wider may reduce the latency under a non-uniform traffic distribution. The minimum channel width may thereby be determined as at least equal to the data throughput of all flows that go over the channel.
To determine the minimum width of a channel, all source and destination pairs are listed whose messages are transmitted over the channel. The user may provide the data rates of all these messages to allow the system to size the channel by adding the data rates of all messages. If the data rates of the messages are not provided, and the data transmit and receive rates of all components are known, alternate implementations may be utilized to determine the minimum channel width (or maximum sustained data rate) in a complex system with complex traffic profile.
For example, consider four processors and two memories and shared memory computer environment, in which any processor can talk to any memory. Assume that the peak transmit rate of the processors is 64-bit/second and peak receive rate of the memories is 64-bit/second. In such an environment, a processor can send some fraction of the data to one memory and the remaining fraction to the other, or can entirely communicate with just one memory. Depending upon the communication pattern, a processor can transmit at a 64-bit/second rate.
Therefore, with four processors a NoC with 256-bit/sec peak channel throughput may be designed. However, in this example, the aggregate rate of communication can never exceed 128-bit/second, which is the aggregate receive capacity of the two memories. The load on various channels also depends upon the interconnect topology, routes, and the connection locations of the various components.
Assume the four processors and two memories are connected using a bi-directional ring, as shown in
In a more complex system and communication pattern, it may become challenging to determine the peak data rate on various channels, for the minimum needed channel width may be computed. Described below is process which can be used to determine the peak data rate at any given channel of the NoC interconnect, in accordance with an example implementation.
All pairs of components which send messages over a channel are used to construct a directed flow or bipartite graph which involves of all source components on the left hand side and the destination components on the right hand side. Then, the source and destination components are connected with directional edges. The capacity of an edge is the maximum rate at which the source components can send data to the destination components along this route. The capacity cannot exceed the peak transmit rate of the source and peak receive rate of the destination. Subsequently two more components are added to the graph, one at the far left as S and one at the far right as D in the graph. The leftmost component is then connected to the list of source components on the left with directed edges, one for each source components. The capacity of these edges is the peak transmit rate of the source components. The list of destination components on the right are then connected to the rightmost edge with directed edges, one from each destination component. The capacity of these edges is the peak receive rate of the destination components.
Once such a directed flow graph is constructed for the given channel, the flow diagram of
In an interconnect that contains both virtual and physical channels, different virtual channels can be sized differently; while the physical channel width will be same as the width of the widest virtual channel. An example implementation of such as system is shown in
For example, if there are two virtual channels on a physical channel, and each virtual channel needs a minimum of 64 bits/cycle of bandwidth, then the virtual channels cannot be sized at 64-bits each. If they are sized as such, then each of the virtual channels will receive only 32-bits/cycle of bandwidth assuming that they share the physical channel equally. To ensure that both virtual channels at least 64-bit/cycle of data bandwidth, they need to be sized at 128-bits each (assuming that they share the physical channel equally).
If there are large numbers of virtual channels sharing a physical channel, and each virtual channel has a different minimum bandwidth requirement, then it is non-trivial to size the virtual channels so that all of them meet their bandwidth requirements. In an example implementation, the problem is mapped to standard linear optimization or linear programming problem. We can then utilize a standard linear programming solution to derive the virtual channel widths in such scenarios.
For example, assume that n virtual channels (VC1, VC2, . . . VCn) are sharing a physical channel, and their minimum bandwidth requirement is (B1, B2, . . . Bn). Thus, the total bandwidth requirement at the physical channel is ΣBi for i=1 to n. Let si indicate the width of VCi. Clearly for all i=1 to n, si>B. The width of the physical channel is max(for i=1 to n, si). Assuming that VCi wins a fi fraction of times during arbitration, the bandwidth received by VCi is fi×si, which must be ≧Bi, the required bandwidth. Since all VCs share the physical channel Σfi≦1. Thus the following sets of linear constraints must be satisfied at all times:
si≧Bi for all i=1 to n.
fi×si≧Bi for all i=1 to n.
0≦fi≦1 for all i=1 to n.
Σfi≦1 for i=1 to n.
The buffer cost of the channel with virtual channel widths si is Σsi. If the design goal is to minimize the buffer cost, then the objective function is to minimize Σsi. If the design goal is to minimize the number of wires, then the objective function max(for i=1 to n, si) can be minimized, which is the physical channel width. If the goals are different, then alternative objective functions may be constructed. At this point, a standard linear programming algorithm can be used to solve the problem and get the values of si, the width of various virtual channels that meets the constraints and minimizes/maximizes the objective function.
An example implementation therefore includes mapping the virtual and physical channel sizing problem to the standard linear programming optimization method, by listing of constraint inequalities, and constructing an objective function based upon the optimization goal.
The server 905 may also be connected to an external storage 950, which can contain removable storage such as a portable hard drive, optical media (CD or DVD), disk media or any other medium from which a computer can read executable code. The server may also be connected an output device 955, such as a display to output data and other information to a user, as well as request additional information from a user. The connections from the server 905 to the user interface 940, the operator interface 945, the external storage 950, and the output device 955 may via wireless protocols, such as the 802.11 standards, Bluetooth® or cellular protocols, or via physical transmission media, such as cables or fiber optics. The output device 955 may therefore further act as an input device for interacting with a user.
The processor 910 may execute one or more modules. The width adjustment module 911 is configured to determine and/or adjust a width for each of a plurality of channels in a network on chip (NoC) based on at least one performance objective for the each of the plurality of channels or a maximum flow of the each of the plurality of channels, such that at least one of the plurality of channels has a different width than at least another one of the plurality of channels. The width adjustment module 911 may be further configured to determine the maximum flow of the each of the plurality of channels from an application of a maximum flow algorithm on a graph of data traffic of the plurality of channels. The width adjustment module 911 may be further configured to determine the width of each of the plurality of channels by applying linear programming to determine the width that meet the at least one performance objective while minimizing at least one specified cost function, and to apply linear programming by constructing a list of constraints for meeting virtual and physical channel performance requirements. The width adjustment module may also be configured to determine at least one objective function for each of the at least one performance objective based on the list of constraints.
The message reformatter module 912 may be configured to construct a message reformatter between connected ones of the plurality of channels having differing widths, and to adjust flits of a message between the connected ones of the plurality of channels having differing widths. The message reformatter module 912 may also be configured to construct the message reformatters such that all source and destination end host pairs of the NoC maintain end-to-end message size and message format consistency.
Furthermore, some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations within a computer. These algorithmic descriptions and symbolic representations are the means used by those skilled in the data processing arts to most effectively convey the essence of their innovations to others skilled in the art. An algorithm is a series of defined steps leading to a desired end state or result. In the example embodiments, the steps carried out require physical manipulations of tangible quantities for achieving a tangible result.
Moreover, other implementations of the example embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the example embodiments disclosed herein. Various aspects and/or components of the described example embodiments may be used singly or in any combination. It is intended that the specification and examples be considered as examples, with a true scope and spirit of the embodiments being indicated by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5432785 | Ahmed et al. | Jul 1995 | A |
5764740 | Holender | Jun 1998 | A |
5991308 | Fuhrmann et al. | Nov 1999 | A |
6003029 | Agrawal et al. | Dec 1999 | A |
6249902 | Igusa et al. | Jun 2001 | B1 |
6415282 | Mukherjea et al. | Jul 2002 | B1 |
6925627 | Longway et al. | Aug 2005 | B1 |
7065730 | Alpert et al. | Jun 2006 | B2 |
7318214 | Prasad et al. | Jan 2008 | B1 |
7590959 | Tanaka | Sep 2009 | B2 |
7725859 | Lenahan et al. | May 2010 | B1 |
7808968 | Kalmanek, Jr. et al. | Oct 2010 | B1 |
7917885 | Becker | Mar 2011 | B2 |
8050256 | Bao et al. | Nov 2011 | B1 |
8059551 | Milliken | Nov 2011 | B2 |
8099757 | Riedle et al. | Jan 2012 | B2 |
8136071 | Solomon | Mar 2012 | B2 |
8281297 | Dasu et al. | Oct 2012 | B2 |
8312402 | Okhmatovski et al. | Nov 2012 | B1 |
8448102 | Kornachuk et al. | May 2013 | B2 |
8492886 | Or-Bach et al. | Jul 2013 | B2 |
8541819 | Or-Bach et al. | Sep 2013 | B1 |
8543964 | Ge et al. | Sep 2013 | B2 |
8601423 | Philip et al. | Dec 2013 | B1 |
8635577 | Kazda et al. | Jan 2014 | B2 |
8667439 | Kumar et al. | Mar 2014 | B1 |
8717875 | Bejerano et al. | May 2014 | B2 |
20020071392 | Grover et al. | Jun 2002 | A1 |
20020073380 | Cooke et al. | Jun 2002 | A1 |
20020095430 | Egilsson et al. | Jul 2002 | A1 |
20040216072 | Alpert et al. | Oct 2004 | A1 |
20050147081 | Acharya et al. | Jul 2005 | A1 |
20060161875 | Rhee | Jul 2006 | A1 |
20070118320 | Luo et al. | May 2007 | A1 |
20070244676 | Shang et al. | Oct 2007 | A1 |
20070256044 | Coryer et al. | Nov 2007 | A1 |
20070267680 | Uchino et al. | Nov 2007 | A1 |
20080072182 | He et al. | Mar 2008 | A1 |
20080120129 | Seubert et al. | May 2008 | A1 |
20090070726 | Mehrotra et al. | Mar 2009 | A1 |
20090268677 | Chou et al. | Oct 2009 | A1 |
20090313592 | Murali et al. | Dec 2009 | A1 |
20100040162 | Suehiro | Feb 2010 | A1 |
20110035523 | Feero et al. | Feb 2011 | A1 |
20110060831 | Ishii et al. | Mar 2011 | A1 |
20110072407 | Keinert et al. | Mar 2011 | A1 |
20110154282 | Chang et al. | Jun 2011 | A1 |
20110276937 | Waller | Nov 2011 | A1 |
20120022841 | Appleyard | Jan 2012 | A1 |
20120023473 | Brown et al. | Jan 2012 | A1 |
20120026917 | Guo et al. | Feb 2012 | A1 |
20120110541 | Ge et al. | May 2012 | A1 |
20120155250 | Carney et al. | Jun 2012 | A1 |
20130051397 | Guo et al. | Feb 2013 | A1 |
20130080073 | de Corral | Mar 2013 | A1 |
20130103369 | Huynh et al. | Apr 2013 | A1 |
20130148506 | Lea | Jun 2013 | A1 |
20130151215 | Mustapha | Jun 2013 | A1 |
20130159944 | Uno et al. | Jun 2013 | A1 |
20130174113 | Lecler et al. | Jul 2013 | A1 |
20130207801 | Barnes | Aug 2013 | A1 |
20130219148 | Chen et al. | Aug 2013 | A1 |
20130263068 | Cho et al. | Oct 2013 | A1 |
20130326458 | Kazda et al. | Dec 2013 | A1 |
20140068132 | Philip et al. | Mar 2014 | A1 |
20140092740 | Wang et al. | Apr 2014 | A1 |
20140115218 | Philip et al. | Apr 2014 | A1 |
20140115298 | Philip et al. | Apr 2014 | A1 |
Number | Date | Country |
---|---|---|
103684961 | Mar 2014 | CN |
2014059024 | Apr 2014 | WO |
Entry |
---|
International Search Report for PCT/US2013/064140, Jan. 22, 2014, 9 pgs. |
Abts, D., et al., Age-Based Packet Arbitration in Large-Radix k-ary n-cubes, Supercomputing 2007 (SC07), Nov. 10-16, 2007, 11 pgs. |
Das, R., et al., Aergia: Exploiting Packet Latency Slack in On-Chip Networks, 37th International Symposium on Computer Architecture (ISCA '10), Jun. 19-23, 2010, 11 pgs. |
Ebrahimi, E., et al., Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems, ASPLOS '10, Mar. 13-17, 2010, 12 pgs. |
Grot, B., Preemptive Virtual Clock: A Flexible, Efficient, and Cost-Effective QOS Scheme for Networks-on-Chip, Micro '09, Dec. 12-16, 2009, 12 pgs. |
Grot, B., Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees, ISCA '11, Jun. 4-8, 2011, 12 pgs. |
Grot, B., Topology-Aware Quality-of-Service Support in Highly Integrated Chip Multiprocessors, 6th Annual Workshop on the Interaction between Operating Systems and Computer Architecture, Jun. 2006, 11 pgs. |
Jiang, N., et al., Performance Implications of Age-Based Allocations in On-Chip Networks, CVA MEMO 129, May 24, 2011, 21 pgs. |
Lee, J. W., et al., Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks, 35th IEEE/ACM International Symposium on Computer Architecture (ISCA), Jun. 2008, 12 pgs. |
Lee, M. M., et al., Approximating Age-Based Arbitration in On-Chip Networks, PACT '10, Sep. 11-15, 2010, 2 pgs. |
Li, B., et al., CoQoS: Coordinating QoS-Aware Shared Resources in NoC-based SoCs, J. Parallel Distrib. Comput., 71 (5), May 2011, 14 pgs. |
Yang, J., et al., Homogeneous NoC-based FPGA: The Foundation for Virtual FPGA, 10th IEEE International Conference on Computer and Information Technology (CIT 2010), Jun. 2010, pp. 62-67. |
International Search Report and Written Opinion for PCT/US2014/012003, Mar. 26, 2014, 9 pgs. |
International Search Report and Written Opinion for PCT/US2014/012012, May 14, 2014, 9 pgs. |
Ababei, C., et al., Achieving Network on Chip Fault Tolerance by Adaptive Remapping, Parallel & Distributed Processing, 2009, IEEE International Symposium, 4 pgs. |
Beretta, I, et al., A Mapping Flow for Dynamically Reconfigurable Multi-Core System-on-Chip Design, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Aug. 2011, 30(8), pp. 1211-1224. |
Gindin, R., et al., NoC-Based FPGA: Architecture and Routing, Proceedings of the First International Symposium on Networks-on-Chip (NOCS'07), May 2007, pp. 253-262. |
Number | Date | Country | |
---|---|---|---|
20140098683 A1 | Apr 2014 | US |