The invention relates to an electronic device and a method for arbitrating shared resources.
Among novel system on chip SoC architectures with a multi-hop interconnect, networks on chip (NOC) proved to be scalable interconnect infrastructures, composed of routers (or switches) and network interfaces (NI, or adapters), on one or more dies (“system in a package”) or chips. However, only a few of the proposed architectures offer guaranteed services (or quality of service, QoS), such as guaranteed throughput, latency, or jitter.
One example of such an architecture is the thereal architecture with contentionfree routing or distributed TDMA as described by E. Rijpkema, K. Goossens, and P. Wielage, “A router architecture for networks on silicon”, In Proceedings of Progress 2001, 2nd Workshop on Embedded Systems, Veldhoven, the Netherlands, October 2001. A further example is the Nostrum architecture with hot-potato routing with containers as shown by M. Millberg, E. Nilsson, R. Thid, and A. Jantsch, “Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip”, In Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE), 2004. “aSOC: A scalable, single-chip communications architecture” by J. Liang, S. Swaminathan, and R. Tessier. In Proc. Int'l Conference on Parallel Architectures and Compilation Techniques, 2000, show an aSOC with a variation on distributed TDMA.
However, these networks on chip NOCs require a global notion of synchronicity to avoid the contention of packets in the network on chip NOC by scheduling packet injection. Typically, these networks on chip have been implemented in a synchronous manner (i.e. with one global clock, either 100% synchronously or mesochronously).
Many other networks on chip NOCs have been reported without time-related (throughput, latency, jitter) Quality of Service QoS. Therefore, these do not require a global notion of synchronicity, such that their implementation may be synchronously or asynchronously. Examples are a synchronous SPIN architecture by P. Guerrier, “Un Réseau D'Interconnexion pour Systémes Intégrés”, PhD thesis, Université Paris VI, March 2000, an asynchronous router by Felicijan, Arteris's asynchronous NOC (www.arteris.net), Sonics's Silicon Backplane (www.sonicsinc.com). The synchronous implementations (e.g. SPIN and Sonics) can easily implement global arbitration schemes. The asynchronous schemes (Arteris, Felicijan) do not use a global arbitration scheme.
For an implementation of quality of service QoS, i.e. guaranteed throughput and guaranteed latency, an end-to-end arbitration is required for a multi-hop interconnect such as a network on chip. These multi-hop interconnects require multiple arbiters wherein all arbiters between a master and a slave, i.e. between a requester and a responder, have to cooperate in order to enable an end-to-end arbitration. In other words, a global notion of time is required between the master and the slave. Such a global notion of time can easily be implemented within a system on chip SOC which comprises a synchronous clock. However, a system on chip cannot be implemented 100% synchronously. This has led to an approach of a globally asynchronous, locally synchronous GALS design. In “Globally-asynchronous locally-synchronous architecture for VLSI systems” by Jens Muttersbach, Series in Microelectronics, Volume 120, Hartung—Gorre Verlag Konstanz, 2001, the basic concept of the GALS architecture is described.
The general architecture of a GALS building block is shown in
To cover the diverse requirement for inter-module communication, two families of port controllers are useful, namely a poll-type and a demand-type port. A Poll-type (P-type) port issues the request for clock stretching exclusively to prevent metastability and thus ensures data correctness. The clock is influenced as scarce as possible. A Demand-type (D-type) port also ensures data integrity on the transfer channel but adds a feature similar to clock gating. As soon as it is enabled it stops the local clock and releases it as soon as the required transfer has taken place.
Furthermore, an implementation of the port types in an input and output variant is shown in
In
The gray shaded area marks the transparent phase of the data latches L (Ap=1). At the time the latch L opens the receiving clock is inactive (Ai2=1) and remains inactive far longer then than the propagation delay of the latch. This ensures that the events on the data lines arrive at the receiving flip-flops safely and no metastability can occur. Keeping the sending clock stopped (Ai1=1) assures that data1 do remain stable while the latches are transparent.
It is an object of the invention to provide an electronic device and a corresponding method for implementing Quality of service in the absence of a global synchronous clock.
This object is solved by an electronic device according to claim 1, a method for arbitrating shared resources according to claim 18, and the use of tokens to communicate a notion of time between arbiter units according to claim 19.
Therefore, an electronic device is provided comprising a plurality of first shared resources; and a plurality of arbiter units each for performing an arbitration for at least one of the plurality of first shared resources. The communication between the arbiter units is performed on an asynchronous basis, and the data communication between the first shared resources is performed on an asynchronous basis. Each arbiter unit is adapted for sending a first token to at least one neighboring arbiter unit, and for receiving a second token from at least one neighboring arbiter unit to implement a first global notion of time.
Hence, the proposed global arbitration scheme is scalable in the number of arbitration units, which is an advantage over the use of a synchronous communication between the arbitration units which is not scalable.
According to an aspect of the invention the electronic device further comprises a plurality of ports and an asynchronous interconnect means being a first shared resources for coupling the plurality of ports. The interconnect means comprises a plurality of interconnect units each being a second shared resource and a plurality of arbiter units for performing an arbitration for at least one of the plurality of second shared resources and for sending a first token to at least one neighboring interconnect component, and for receiving a second token from at least one neighboring interconnect component to implement a second global notion of time within the interconnect means. Accordingly, the global notion of time can also be realized in the interconnect allowing an implementation of quality of service within an asynchronous interconnect and hence between the ports
The invention further relates to a method for arbitrating shared resources within an electronic device having a plurality of first shared resources. A plurality of arbitrations for at least one of the plurality of first shared resources is performed. The communication between arbitrations is performed on an asynchronous basis. The data communication between the first shared resources is performed on an asynchronous basis. Each arbitration comprises a step of sending a first token to at least one neighboring arbitration, and of receiving a second token from at least one neighboring arbitration to implement a first global notion of time.
The invention further relates to the use of tokens to communicate a notion of time between arbiter units for performing a plurality of arbitrations for at least one of a plurality of first shared resources in an electronic device. The communication between the arbitration units is performed on an asynchronous basis. A data communication between the first shared resources is performed on an asynchronous basis. This is advantageous as tokens usually merely communicate data and not time.
The invention is based on the idea to provide an asynchronous implementation of a distributed global arbitration schemes (e.g. memory controller and network on chip NOC arbitration scheme, communication assist and network on chip NOC arbitration scheme in a tile-based approach). A global notion of synchronicity (or arbitration scheme) is provided which can be implemented asynchronously in a distributed fashion. It can applied to implement networks on chip NOCs (or, more generally communication infrastructures, such as hierarchical/bridged busses) with other arbitration schemes that require a global notion of synchronicity too, such as rate-controlled schemes (e.g. virtual-circuit-queued or output-queued) and deadline based schemes. Fundamentally, the basic idea is that a network on chip NOC can implement global notion of synchronicity (or a global schedule) by being made up of components (e.g. routers, network interfaces) that exchange tokens every logical unit of synchronization (or time step or data flow firing).
The invention is preliminary directed to the case of a) an asynchronous network on chip NOC coupling IP blocks at multiple or divisor of network on chip NOC synchronization rate, i.e. demand-driven; b) an asynchronous network on chip NOC coupling IP blocks IP which do not operate at multiple or divisor of network on chip NOC synchronization rate, i.e. are data-driven; and c) an asynchronous network on chip NOC coupling IP blocks IP which do not operate at multiple or divisor of network on chip NOC synchronization rate, i.e. are event-driven.
Further aspects of the invention are described in the dependent claims.
These and other aspects of the invention are apparent from and will elucidated with reference to the embodiments described hereinafter and with respect to the following figures.
a-d shows a network on chip with routers R and network interfaces NI as interconnects as well as IP blocks;
The present method of providing QoS (in particular bounded latency) consists in the data-flow model underlying contention-free routing, as documented in E. Rijpkema, K. Goossens, and P. Wielage, “A router architecture for networks on silicon”, In Proceedings of Progress 2001, 2nd Workshop on Embedded Systems, Veldhoven, the Netherlands, Oct. 2001. The logical unit of synchronization can be a flit, as explained by E. Rijpkema, K. G. W. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, and E. Waterlander, “Trade offs in the design of a router with both guaranteed and best-effort services for networks on chip”, In Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 350-355, March 2003. This scheme can be implemented on a synchronous basis, as explained in cited papers, but also to asynchronous implementation according to the invention.
The global notion of time describes a situation where an (possibly every) arbiter unit is aware of the state or status of (all) other arbiter units. Therefore, if an arbiter unit is in step 3, all the other arbiter will also be in step 3.
a) and 2(b) show block diagrams of a multi-hop interconnect IM coupling several IP blocks according to a first embodiment. The interconnect IM comprises several routers R and network interfaces NI as interconnect component or interconnect node for connecting the routers to the IP blocks IP.
An asynchronous implementation of a router R (or other network on chip NOC component) result, upon start up/reset, firstly in a production of a token T on every output, i.e. each link to other network on chip NOC components as shown in
This concept can be used for rate-controlled and dead-line based global arbitration schemes too. Note that the tokens T either contain data or are empty. Even in the absence of data they must be sent to maintain the notion of synchronicity.
Now the implementation of Quality of service for an asynchronous interconnect IM is described. The network on chip NOC components will advances as slowly as the slowest component, constituting the synchronization rate of the network on chip NOC as a whole. The number of iterations per second is related to the “actual clock speed.” For example, a synchronization step may correspond to three clock cycles. The fact that the synchronization rate is generated internally in the network on chip NOC, i.e. by the slowest component, and not imposed by an external known clock (as is the case for fully synchronous networks on chip NOCs) is not problematic, and does not invalidate the concept of QoS because all asynchronous components within the network are designed with a certain target frequency of operation in mind.
As an example for illustration, the target frequency may be 166 M synchronizations/sec or 166 Mega flits/sec; where a flit may be 3 words of 32 bits each. By taking the appropriate margin (or “over-designing”), by 20% for instance, the components should run at 200M synchronizations/sec or 200 M flits/sec, but the slowest component will surely run faster than the intended 166M synchronizations/sec or 500 M words/sec, leading to a guaranteed throughput of at least 166M synchronizations/sec or 500 M words/sec, and a potentially faster operating network on chip NOC. The actual margin will depend on the accuracy of chip processing, worst-case operating conditions, and so on. This line of reasoning is accepted equally for synchronous and asynchronous modules/ICs.
a-d shows a network on chip with routers R and network interfaces NI as interconnects as well as IP blocks IP coupled to the respective network interfaces NI according to a second embodiment. The IP blocks may operate at multiple rates (or divisor rates) using different token rates. Accordingly, Quality of Service (QoS) of an asynchronous multi-hop interconnect IM with the IP blocks IP running at multiples or divisors of network on chip NOC synchronization rate are shown. In
In both cases, the solution is only applicable for IP blocks running at multiples or divisors of the network on chip NOC frequency. Moreover, in the synchronous case, it is no longer feasible to have a single synchronous clock serving all IP blocks attached to a network on chip NOC.
In the synchronous case, the use of multiple independent clocks for IP and network on chip NOC (which operates on one clock) relies on data synchronization, i.e. the use of two flip-flops in series to cross from one clock domain (of the IP) to another (that of the network on chip NOC), or vice versa. This can be referred to as data-driven synchronization. Although such a solution will work, it is not optimal because errors may occur when sampling data coming from another clock domain. This situation gets worse as both frequencies increase.
In the asynchronous case, the synchronization of multiple independent clocks for the IP and network on chip NOC which operates with a logical notion of synchronicity, can be solved by demand-driven synchronization, data synchronization or by event-driven synchronization. The first solution cannot cope with all clock ratios, variable clocks, etc. The second solution introduces the potential for incorrect data. The third solution has neither problem.
In the case of data driven synchronization every module, on every of its communication lines to other modules, samples the lines when it advances its clock. This can be done with the double flip-flop scheme. Potential problems with incorrect data samples are introduced. In particular, there is a probability that a bit which is sampled using the two flip-flops is incorrect. By using more flip-flops this probability can be reduced, at the cost of an increased latency. Now note that for every data-driven port/link on the system this error probability exists, and that these probabilities add up, in the sense that errors do not cancel each other out or compensate for each other.
A demand-driven synchronization is shown in
The network interface NI comprises an exclusive OR unit XOR, connected to a mutual exclusion unit mutex, which in turn is connected to a toggle unit TU. The output of the toggle unit TU is connected to a logic unit LU and constitutes the response signal ip2ni_ack. A feed back loop with a delay line and inverter DLI is coupled to the mutual exclusion unit mutex. The two input mutual exclusion element mutex is a standard asynchronous building blocks.
The response part of the network interface NI is arranged in a corresponding manner without the delay and inverter DLI.
Basically, whenever an external event from the IP arrives at the NI a state element is toggled to store this information (that the IP has communicated) so that it can be used by the logic block. The event is then acknowledged by the signal ip2ni_ack to the IP block IP. The acknowledge to the IP block is in the critical path and must be as quick as possible. For this reason the toggle element TU lowers the request line (going into the mutual exclusion element), immediately, without requiring any interaction from the potentially very slow IP block. The IP block can then respond to the acknowledge at leisure. The logic unit LU uses the information that the request line ip2ni_valid has been high, e.g. to read out the request data.
It should be noted that the above mentioned operations normally do not stop the internally generated clock of the NI at all.
The router comprises an exclusive OR unit XOR, connected to a mutual exclusion unit mutex, which in turn is connected to a toggle unit TU. The output of the toggle unit TU is connected to a synchronous router core NSR. A feed back loop with a delay line and inverter DLI is coupled to the mutual exclusion unit mutex. The two input mutual exclusion element mutex is a standard asynchronous building blocks.
In the upper part
Now the interaction between a network on chip NOC (synchronous or asynchronous) and the IP blocks is considered. The QoS (e.g. guaranteed latency) as implemented by the network on chip NOC will only stretch from the master mNI to the slave mNI. If the master (slave) and network on chip NOC (i.e. master (slave, resp) NI) operate synchronously, i.e. within the same or derived clock domain (i.e. without clock domain crossing), then the QoS guarantees will extend from the master to the slave. Similarly, if the network on chip NOC is asynchronous, and the master (slave) synchronizes every (fixed multiple) time step with the master (slave, resp) NI, the QoS will extend from the master MIP to the slave SIP. Accordingly, this will correspond to an asynchronous (multi-rate SDF) situation, i.e. a demand-driven synchronization.
In
In
Here, a S-type port is used for the output and input port controllers OPCU, IPCU for a locally synchronous island LSM1, LSM2 that is running at a clock that can not be stopped. Such a clock is typically an externally generated clock. Such locally synchronous island LSM1, LSM2 does not have a pausable clock generator PCG). The locally synchronous island LSM1, LSM2 can enable the S-type port (by toggling the En signal) to perform a data communication. When the signal Ta toggles—in turn—the data communication has been performed. The implementation of a S-type port is basically a free-running P-type port as the S-type port does not interfere any clock. A flip-flop FF is used to make signal Ta synchronous to the LSM clock signal. Therefore, instead of clock-synchronization which is employed by the P and D type ports, a data-synchronization is employed.
The response path works in a similar way. The request and response path are implemented in this way to ensure that the NI is pausable (i.e. its local clock can be stopped), but for a short time only. Note that the NI alone is stopped, clocks of any attached routers are not stopped, only their demand-driven handshakes may take a little longer. If a NI that is stopped for a short time, is attached to a fast router (e.g. due to process variation, or temperature differences) the momentary stalling of the NI may be compensated for by the router. In this way, a distributed asynchronous network on chip NOC can cope better with pausing than a globally clocked synchronous network, where all any delay incurrent due to a stalled NI cannot be made up for any more. This affects the latency only, not the throughput, which is always reduced to the slowest feedback loop.
If we consider the delays of the clock due to incoming events as errors, then, in contrast to the data-driven synchronization case, described above, these errors do not add up. That is, if multiple NIs are delayed at the same time, then the network on chip NOC as a whole will be delayed only by the worst of these delays, not the sum of the delays. This is an advantage of the event-driven synchronization scheme over the data-driven scheme.
If we over dimension the NI speed for example by 5%, then the mean time between failure for a single clock period is reduced, because 5% additional time for the mutual exclusion element mutex is available to settle. If multiple successive clock periods (for example 3) is considered, then the probability that the NI is too slow after 3 clock periods, is lower than the probability that the NI is too slow after 1 clock period, because if one delaying event occurs in the 3 clock periods, it has 3×5% slack to settle, instead of just 5%. Similarly for two delaying events during 3 periods (they each have 1.5×5% slack). For three delaying events, no additional slack is available. This is an advantage of the event-driven synchronization scheme over the data-driven scheme.
Accordingly, the physical (timing and clocking) aspects of networks on chip NOCs are relaxed: there needs to be no global clock for the network on chip NOC. The networks on chip NOCs are better scalable in terms of number of components, and hence performance. The IP and network on chip NOC can run at any independent speeds, (for event-driven IPNOC synchronization) without fear of incorrect data but with an a priori known mean time between failure in terms of missing time deadlines.
On the other hand, the testing of asynchronous circuits is harder than for synchronous circuits. The standard hardware backend flow (synthesis, timing verification, etc.) is more adapted to synchronous instead of asynchronous designs.
In a network on chip NOC based on the introduced GALS technology according to a fifth embodiment. To implement demand-driven communication between NOC and IPs, D-type ports are used at both sides of the channels between NIs and IPs. Since all channels use the D-type kind of ports, coherent progress of all blocks is guaranteed. Since D-type ports are 100% deterministic, the resulting amount performance is as well.
Other methods (from general networks) for providing QoS are known in the literature (in particular, rate-controlled schemes as described by H. Zhang. Service disciplines for guaranteed performance service in packet-switching networks. Proceedings of the IEEE, 83(10):1374-96, October 1995, and dead-line based schemes as described by J. Rexford. Tailoring Router Architectures to Performance Requirements in Cut-Through Networks. PhD thesis, University of Michigan, department of Computer Science and Engineering, 1999, but no networks on chip NOCs have been reported that implemented these schemes. These methods rely on a global notion of synchronicity, also.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim. The word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. In the device claim in numerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are resided in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Furthermore, any reference signs in the claims shall not be constitute as limiting the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
05101716.8 | Mar 2005 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB06/50649 | 3/2/2006 | WO | 00 | 8/24/2007 |