This application claims priority from European Patent Application No. 22189135.1, which was filed on Aug. 5, 2022, and is incorporated herein in its entirety by reference.
There are disclosed a semiconductor device and some possible applications. In particular there are disclosed techniques for interfaces and control flow for a multi-core semiconductor device architecture for globally asynchronous processing of continuous time binary valued signals.
Consider a scalable system constructed from the interconnected hardware nodes each of which may contain analog, digital or mixed-signal. Assume that the hardware nodes run simultaneously, produce globally asynchronous continuous-time binary value (CTBV) output signals and/or receive CTBV input signals, where the absolute timing of the signals (e.g. rising and falling signal edges) conveys information. The transmission delay should be minimal, and must be latency-deterministic for a given configuration of the semiconductor device system, i.e. independent of the processed data, other transmitted CTBV signals. Assume further that some parameters of the nodes as well as the connections between them need to be (partially) reconfigurable at runtime, e.g. to allow a time-multiplexed operation of the system, but this must not cause a loss of data while (parts of) the system reconfigure. Finally, the system should be able to operate in a self-timed mode.
One specific example of this situation occurs in neuromorphic computing, where the nodes contain mixed-signal implementations of (groups of) spiking neurons and their synaptic connections, and the entire system implements an accelerator for spiking neural networks (SNNs).
It is in general difficult to define a scalable system architecture comprising an interconnect system and interfaces that support configurable routing of simultaneous CTBV signals with fixed and deterministic (for a given configuration) transmission delays. Current implementations usually circumvent this problem by instead transmitting the timing information contained in the CTBV signals via dynamically routed packages, which typically contain both a timestamp as well as additional information required for routing.
It is also difficult to define a control flow and interfaces that support the partial, self-timed and self-controlled run-time reconfiguration of the system, e.g. for the purpose of time-multiplexing, without introducing unnecessary latency and without causing a loss of data during the reconfiguration periods.
Conventional technology is rooted in different research fields. Such fields are field-programmable analog arrays (FPAAs), field-programmable gate arrays (FPGAs), globally asynchronous locally synchronous (GALS) chip architectures and neuromorphic processors. In none of these fields a system architecture with control hierarchy and processing scheme was proposed previously, which supports both
FPGAs [1] [2], FPAAs [3] [4] or field programmable mixed-signal arrays (FPMAs) [5] present some degree of configurability of connections between programmable function units (PFUs) or analog circuits (e.g. filters [4]). These classes of devices can realize, within limits, arbitrary functionality by synthesis, i.e. by opportunely connecting and configuring these components so that they realize a specified behavior. However, given the limitations imposed by the available components within such a device, the synthesizable circuits result to be substantially slow, poorly power-efficient, and require considerable space. In particular, these classes of devices typically do not provide any special circuitry to support the above-mentioned runtime reconfigurability of the system, separation of CTBV signals and configuration data, and the self-timed operation as explained later. In the specific context of neuromorphic hardware, the functionality of the constituting parts (neurons and synapses) is typically described in terms of analog circuits that, in general, fail to have an equivalent, efficient FPGA-, FPAA- or FPMA-synthesizable solution.
Globally asynchronous locally synchronous (GALS) architectures are systems where processing nodes can run locally synchronously, but communicate globally asynchronously with each other. Typical implementations of GALS architectures connect multiple locally synchronously operated processing nodes via an asynchronous interconnect system or Network-on-Chip (NoC), that uses some kind of synchronizers at the interfaces to locally synchronously operated circuits. A typical GALS architecture may have an asynchronous wrapper implementing handshake-based communication between processing nodes [6] [7]. Other implementations often use FIFO synchronizers [6] [8] [9].
The handshake-based communication usually is implemented with request and acknowledge 2-phase or 4-phase protocol [10]. When also transmitting data besides performing the handshake, a bundled data communication protocol can be applied where the delays of the handshake signals must be longer than the ones of the data signals. If timing assumptions are not possible delay-insensitive data codes can be used to encode whether a signal is currently valid within the signal itself. Popular codes are the dual-rail code, where one bit is encoded with two bits or 1-of-N codes, where one wire is used for each value of the data [10].
A GALS system has been proposed with the same split between processing nodes and communication nodes, where an asynchronous NoC architecture is described. There is proposed a package-switched communication via routers with each five connections, four to other routers in a grid and one to an associated synchronous processing node. Between routers data is transferred based on a handshake with a bundled data protocol. Besides the actual data the transmitted data also contains information about the path to the target of the data and its priority. At the input port of a router data is forwarded to one output port based on the path information. While high priority data is processed at the next possible time, low priority data is buffered at the output port. The buffered data is transmitted with a first arrived, first served prioritization. It is argued that at system level the routing algorithm will ensure that high priority data should never conflict. This constraint limits the number of possible paths through one router at the same time that can carry high priority data to one. Low priority data is allowed to conflict and then be delayed. Therefore, the transmission delays in the described system are priority and traffic dependent. With priority and traffic dependent delays it becomes very difficult or even impossible to encode information directly in the timing of data. An example GALS system architecture also describing system control and program execution for performing a spiking convolution is given in [8]. The system consists of a global controller called Hub and multiple locally synchronous processing nodes called neurons, each connected to the Hub via FIFO synchronizers. The Hub uses the same FIFO to broadcast to all neurons time-multiplexed, what means multiple neurons cannot be fed with inputs simultaneously. Therefore, any information directly encoded into the timing of signals generated by neurons is lost.
Local memories of the neurons are only updated when receiving input spikes and not used for configuration of the neuron. All configuration is located in the Hub.
Different neuromorphic processors with different levels of complexity and programmability have been proposed so far. First existing systems will be categorized in dependence on how they implement synaptic connections between neurons.
The first class uses dedicated signals for all the neurons' in- and outputs, which are arranged into a so-called crossbar structure [12]. Here, the neurons' outputs (as well as external inputs) are connected to the rows of the crossbar, whereas the neurons' inputs are connected to the columns (or vice versa). By appropriately setting programmable switches that each connect one column to one row, arbitrary connections between the neurons' outputs (or external inputs) and the neurons' inputs can be established. This type of interconnect can operate asynchronously, but the number of programmable switches and thus its area grows quadratically with the number of neurons in the network. Some variations of this architecture therefore reduce the degrees of freedom by either (partially) randomizing connections or by (partially) enforcing regular connectivity structures. Crucially, these architectures typically sum up multiple incoming signals in the analog domain by connecting multiple input signals to the same output signal. Often, multi-level resistive, memristive or equivalent devices are used instead of binary switches to implement analog in-memory computing. In that case, the interconnect uses analog or mixed signals. As a standalone solution without interconnections of any other type between the output of a crossbar and it's input or the inputs of another crossbar, such a solution does not provide the same level of programmability and flexibility as the invented system architecture.
The second class of neuromorphic processors uses interconnects that consists of a regular network-on-chip that transmits individual spike events as digital packages. Examples of this are proposed in and [13] and patented in and [15]. Since amplitude information of a spike is usually binary and the timing of a spike event should be given implicitly by the time at which the package is generated, the only information to be transmitted is the target address of the package. Based on the target address contained in the package, each router in the NoC forwards the package in the appropriate direction until it reaches its destination. This routing scheme and variations thereof are known as Address Event Representation (AER) and widely used. Both synchronous and asynchronous versions of this approach are in use. The area consumption for this form of package-switched routing only grows linearly with the number N of neurons (if the bus width is fixed and wider addresses are serialized) or proportionally to N log(N) (if the bus-width is chosen wide enough to hold the entire target address in parallel). However, package switched routing requires that each spike's address is encoded into a multibit digital package, which then has to traverse multiple routers on its way to the target. Each router must decode this address information, and must implement arbitration between simultaneously arriving spikes. Since each router is shared between many different signal paths, congestion can occur, which has to be resolved or prevented through appropriate routing strategies and handshaking. This in turn adds substantial complexity and usually traffic dependent transmission delays as already argued previously. For example, In a system as proposed in [16], AER as an input protocol does not allow the parallel processing of multiple near simultaneous input spikes, which makes it impossible to encode information directly in the precise timing of signals. In [13] timing information of spikes is captured by adding a time-stamp to every spike package and in [14] a global synchronization of time steps ensures that spikes of one time step are processed together in the next one.
The third class of SNN accelerators combines the former two approaches in some hierarchical structure. On a lower level of the hierarchy, multiple neurons are densely interconnected by some form of a crossbar into cores (also variously referred to as tiles or (neural) processing units). These cores are interconnected through some form of package-switched NoC, potentially via multiple levels of hierarchy. For these systems the same arguments can be applied as for the second class.
Regarding the control flow both Intel's Loihi [14] [17] and IBM's TrueNorth [13] [15] also use global synchronization for distinct phases of operation. TrueNorth uses global synchronization signals called ticks. Within each tick each processing core is processing the spike messages in its input buffer it received during previous ticks, might generate new spike messages and transmits them to other processing cores globally asynchronously. Received spike messages are buffered and not processed before the next tick. A scheduler within each processing core is deciding whether a spike in its input buffer should be processed within this tick based on the time stamp information attached to each spike message. The synchronization here is only decided by the global synchronization signal and does not depend on any information of individual processing cores. The time per tick is decided based on the worst-case processing duration per tick. In TrueNorth processing is split into ticks to enable a global synchronization of the program execution, which results in quantization of spike timing to global synchronization steps (1 ms in TrueNorth) and spikes generated within one tick or processing step cannot be processed within the same step. This drastically limits the possibility to encode information directly in spike timing.
Intel's Loihi has a slightly different approach, by not synchronizing processing steps with a constant global time period, but using feedback from all processing cores, which carries the information, that a processing core has processed all spikes it received at the previous processing step. When all processing cores have finished the last processing step the system moves on to the next one. This handshaking is implemented using a message-passing from one core to all its neighbors.
An embodiment relates to a semiconductor device system including a central controller and a plurality of hardware nodes implemented in application-specific integrated circuit, ASIC, the hardware nodes being mutually interconnected with each other through a plurality of hardwired connections which support the transmissions of globally asynchronous continuous-time binary value, CTBV, signals, in such a way that each hard-wired connection supports the propagation of one unique CTBV signal from a first transmitting hardware node connected to the hard-wired connection to at least one receiving hardware node connected to the hard-wired connection, so as to define at least one point-to-point(s) communication path between at least two hardware nodes, which are processing nodes, along a sequence of hard-wired connections connected to each other through at least one switching circuitry, wherein the at least one switching circuitry is controlled by at least one hardware node, of the plurality of hardware nodes, which is a communication node, the at least one switching circuitry being configured to selectably connect, based on configuration data, at least two hard-wired connections in the sequence of hard-wired connections, so as to permit the transmission of each CTBV signal along the sequence of hard-wired connections, wherein the at least one switching circuitry is latency-deterministic, wherein each hardware node of the plurality of hardware nodes is configured to download configuration data through a package-switched configuration communication path.
According to another embodiment, a method for a semiconductor device system including a plurality of hardware nodes implemented in application-specific integrated circuit, ASIC, the hardware nodes being mutually interconnected with each other through a plurality of hardwired connections which support the transmissions of globally asynchronous continuous-time binary value, CTBV, signals, in such a way that each hard-wired connection supports the propagation of one unique CTBV signal from one transmitting hardware node connected to the hard-wired connection to at least one receiving hardware node connected to the hard-wired connection, so as to define at least one point-to-point(s) communication path between at least two hardware nodes, configured as processing nodes, along a sequence of hard-wired connections connected to each other through at least one switching circuitry, may have the steps of: downloading, by the hardware nodes of the plurality of hardware nodes, configuration data through a package-switched configuration communication path; by the hardware nodes configured as processing nodes, processing, transmitting and/or receiving CTBV signals according to the downloaded configuration data; and by at least one hardware node configured as communication node, based on the downloaded configuration data selectably connecting, by at least one latency-deterministic switching circuitry controlled by the at least one hardware node configured as communication node, at least two hard-wired connections in the sequence of hard-wired connections, thereby permitting the transmission of CTBV signals along the sequence of hard-wired connections.
In accordance to an aspect, there is provided a semiconductor device system comprising a central controller and a plurality of hardware nodes implemented in application-specific integrated circuit, ASIC, the hardware nodes being mutually interconnected with each other through a plurality of hard-wired connections which support the transmissions of globally asynchronous continuous-time binary value, CTBV, signals, in such a way that each hard-wired connection supports the propagation of one unique CTBV signal (e.g. the unique CTBV signal propagating through the hard-wired connection supports) from a first transmitting hardware node connected to the hard-wired connection to at least one receiving hardware node connected to the hard-wired connection, so as to define at least one point-to-point(s) communication path (e.g. with deterministic, of other CTBV signals independent delay) between at least two hardware nodes, which are processing nodes, along a sequence of hard-wired connections connected to each other through at least one switching circuitry. The at least one switching circuitry may be controlled by at least one hardware node (e.g. each switching circuitry may be part of a respective hardware node which is a communication node), of the plurality of hardware nodes, which is a communication node. The at least one switching circuitry (e.g. each switching circuitry) may be configured to selectably connect, based on configuration data, at least two hard-wired connections in the sequence of hard-wired connections, so as to permit the transmission of each CTBV signal along the sequence of hard-wired connections (e.g. a sequence of hard-wired connections alternated with switching circuitries may define a point-to-point communication path). The at least one switching circuitry may be, according to an aspect, latency-deterministic. Each hardware node of the plurality of hardware nodes may be configured to download configuration data through a package-switched configuration communication path.
In accordance to an aspect, there is provided a semiconductor device system comprising a central controller and a plurality of hardware nodes implemented in application-specific integrated circuit, ASIC, the hardware nodes being mutually interconnected with each other and exchanging globally asynchronous continuous-time binary value, CTBV, signals through a plurality of hard-wired connections which support the transmissions of CTBV signals (e.g. the unique CTBV signal propagating through the hard-wired connection supports), in which information is encoded, in such a way that each hard-wired connection supports the propagation of one unique CTBV signal from a first transmitting hardware node (e.g. transmitting a CTBV signal in which information is encoded) connected to the hard-wired connection to at least one receiving hardware node (e.g. receiving a CTBV signal in which information is encoded) connected to the hard-wired connection, so as to define at least one point-to-point(s) communication path between at least two hardware nodes, which are processing nodes, along a sequence of hard-wired connections connected to each other through at least one switching circuitry. The at least one switching circuitry is controlled by at least one hardware node, of the plurality of hardware nodes, which is a communication node, the at least one switching circuitry being configured to selectably connect, based on configuration data, at least two hard-wired connections in the sequence of hard-wired connections, so as to permit the transmission of each CTBV signal along the sequence of hard-wired connections, wherein the at least one switching circuitry is latency-deterministic. Each hardware node of the plurality of hardware nodes is configured to download configuration data through a package-switched configuration communication path.
In accordance to an aspect, there is provided a semiconductor device system comprising a central controller and a plurality of hardware nodes implemented in application-specific integrated circuit, ASIC, the hardware nodes being mutually interconnected with each other through a plurality of hard-wired connections which support the transmissions of globally asynchronous continuous-time binary value, CTBV, signals, in such a way that each hard-wired connection supports the propagation of one unique CTBV signal (e.g. the unique CTBV signal propagating through the hard-wired connection supports) from a first transmitting hardware node (which e.g. encodes information onto the CTBV signal, or which converts the CTBV signal from a non-CTBV signal, or which causes the propagation of the CTBV signal) connected to the hard-wired connection to at least one receiving hardware node (which e.g. decodes information from the CTBV signal, or which converts the CTBV signal onto a non-CTBV signal, or which causes the propagation of the CTBV signal) connected to the hard-wired connection, so as to define at least one point-to-point(s) communication path between at least two hardware nodes, configured as processing nodes (e.g. from a first, transmitting processing node encoding information onto the CTBV signal or converting a non-CTBV signal onto the CTBV signal, to a second, receiving processing node decoding information from the CTBV signal or converting the CTBV signal onto a non-CTBV signal), along a sequence of hard-wired connections connected to each other through at least one switching circuitry. The at least one switching circuitry may be controlled by at least one hardware node (e.g. each switching circuitry may be part of a respective hardware node configured as communication node), of the plurality of hardware nodes, which is configured as communication node. The at least one switching circuitry (e.g. each switching circuitry) may be configured to selectably connect, based on configuration data, at least two hard-wired connections in the sequence of hard-wired connections, so as to permit the transmission of each CTBV signal along the sequence of hard-wired connections (e.g. a sequence of hard-wired connections alternated with switching circuitries may define a point-to-point communication path). The at least one switching circuitry may be, according to an aspect, latency-deterministic. Each hardware node of the plurality of hardware nodes may be configured to download configuration data through a package-switched configuration communication path.
The semiconductor device system may be so that at least one hardware node configured as a communication node includes:
In accordance to an aspect, the at least one switching circuitry is so as to delay the propagation of the CTBV signal only based on the hardware configuration of the at least one switching circuitry but not on any of the CTBV signal(s) inputted to the switching circuitry.
In accordance to an aspect, the at least one switching circuitry is an asynchronous combinatorial component.
In accordance to an aspect, each hardware node configured as processing node and each hardware node configured as communication node may be configured to be sequentially in one of at least the following phases:
In accordance to an aspect, each hardware node is configured, when in non-operative phase but ready to enter the operative phase, to provide information signalling readiness to enter the operative phase to the central controller, wherein the central controller is configured to trigger the transmission of the global start command at the reception of the information signalling readiness to enter the operative phase of the totality of hardware nodes configured as processing nodes and by the totality of hardware nodes configured as communication nodes.
In accordance to an aspect, the semiconductor device is connectable to a further semiconductor device having a second-tier controller and a plurality of further hardware nodes, wherein the central controller is configured, when connected to the further semiconductor device, to transmit the global start command and/or the global stop command also to the further semiconductor device, the central controller being configured to receive, from the second-tier controller, information signalling readiness to enter the operative phase by the further hardware nodes.
In accordance to an aspect, each hardware node is configured, once in the non-operative phase, to download, if present, configuration data onto a local configuration memory of the hardware node, the hardware node being configured to subsequently:
In accordance to an aspect, the non-operative phase includes, as subphases, a configuration download phase and at least one transition phase,
In accordance to an aspect, the local configuration memory includes multiple enumerated memory segments enumerated according to a predetermined sequence, wherein each hardware node configured as processing node and each hardware node configured as communication node is configured, during the same non-operative phase, to download multiple enumerated configuration data to be written in the multiple enumerated memory segments according to the predetermined sequence,
In accordance to an aspect, each hardware node configured as processing node and each hardware node configured as communication node is configured, in case no configuration data are downloaded in a current non-operative phase, to select between:
In accordance with any of the preceding aspects, at least one processing node is configured or configurable as input/output, I/O, node as a particular case of a processing node, so as to be configured to:
In accordance to an aspect, the system may be configured for implementing a spiking neural network, SNN, the SNN comprising a plurality of neurons, at least one neuron of the plurality of neurons being configured to have runtime configurable parameter(s) and being configured to output at least one CTBV signal processed as determined by the runtime configurable parameter(s), the SNN including a plurality of synapses between the neurons, each synapse of the plurality of synapses being configured to provide an input signal to a neuron, the signal being provided by an output signal of the same or another neuron,
According to an aspect, there is provided a configuration device for controlling the semiconductor device system of any of the preceding aspects, configured to
According to an aspect, the confirmation device is configured to control the semiconductor device system of a previous aspect, further configured to assign, to each individual neuron or a plurality of neurons, one hardware node configured as processing node, and to each synapse, one point-to-point(s) communication path, the configuration device being configured to provide, to each hardware node configured as processing node, runtime configurable parameter(s) as part of the configuration data, and, to each hardware node configured as communication node, configuration data to switch the latency-deterministic circuitry to perform connections between hard-wired connections so as to instantiate a point-to-point(s) communication path instantiation a synapse.
In accordance to an aspect, the semiconductor device is configured to perform at least one training session, during which different output signals are examined for different input signals and different runtime configurable parameter(s), the configuration device being configured to evaluate the input and output signals and runtime configurable parameter(s) according to a given cost function and optimizing the runtime configurable parameters so as to minimize the cost function.
In accordance to an aspect, there is provided a method for a semiconductor device system comprising a plurality of hardware nodes implemented in application-specific integrated circuit, ASIC, the hardware nodes being mutually interconnected with each other through a plurality of hard-wired connections which support the transmissions of globally asynchronous continuous-time binary value, CTBV, signals, in such a way that each hard-wired connection supports the propagation of one unique CTBV signal from one transmitting hardware node connected to the hard-wired connection to at least one receiving hardware node connected to the hardwired connection, so as to define at least one point-to-point(s) communication path between at least two hardware nodes, configured as processing nodes, along a sequence of hard-wired connections connected to each other through at least one switching circuitry,
In accordance to an aspect, at least one processing node may be configured to transmit, or receive, the CTBV signal as a signal non-synchronized to any clock signal.
In accordance to an aspect, at least one switching circuitry may include an asynchronous combinatorial circuit which does not rely on a clock signal.
In accordance to an aspect, at least one processing node may be configured to decode information from the timing of a CTBV signal only, and/or configured to encode information onto the timing of the CTBV signal only.
In accordance to an aspect, the device may be configured to perform the CTBV signals as physical propagations of electric signals, so that information is encoded in the timing of the CTBV signals.
In accordance to an aspect, the at least one point-to-point(s) communication path may be queueless, so as to cause the CTBV signals to propagate without delays due to simultaneously propagating CTBV signals.
In accordance to an aspect, the at least one switching circuitry may be an arbitrationless circuitry, so that there is no competition, between different CTBV signals simultaneously propagating through the switching circuitry, to gain access to a same resource and/or to be propagated first.
In accordance to an aspect, the at least one point-to-point(s) communication path may avoid any electric contact with any other point-to-point(s) communication path.
In accordance to an aspect, there is provided a method for controlling the semiconductor device system of a previous aspect, comprising:
In accordance to an aspect, there is provided a non-transitory storage unit storing instruction which, when executed by a processor, cause the processor to perform the method of a previous aspect.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
Different hard-wired connection therefore may simultaneously propagate different CTBV signals, through different, electrically disconnected paths.
Each pin (e.g. input/output pin) of the input/output port of each hardware node may therefore be connected to one single hard-wired connection (e.g. 2.4, 3.4, 4.4) to support the transmission of one single CTBV signal, while the input/output port can be connected to a plurality of hard-wired connections (e.g. 2.4, 3.4, 4.4), so as to support the transmission of a plurality of CTBV signals (one for each hard-wired connection). (The plurality of hard-wired connections may be understood as electrically disconnected with each other during the physical propagation of the different CTBV signals). The propagation of the CTBV signals along the hard-wired connections is deterministic, and permits to guarantee the absence of unwanted unexpected delays. Physically, the propagation of the CTBV signals is not delayed by any other CTBV signal propagating simultaneously. In particular, the hard-wired connections are in general not electrically in contact and/or physically interfering with each other. In general, each hard-wired connection (e.g. 2.4, 3.4, 4.4) may fixedly connect (e.g. electrically) two different hardware nodes (or, if an hard-wired connection connects more than two hardware nodes with each other, then there is defined one single transmitting hardware node, and the remaining hardware nodes are receiving hardware nodes). Therefore, each hard-wired connection (e.g. 2.4, 3.4, 4.4) represents a node-to-node connection between one transmitting and at least one receiving node (e.g. in proximity to each other). Each hard-wired connection may be, for example, an electrically conductive line, or an electrically conductive via, or a sequence of electrically conductive line(s) and/or conductive via(s), substantially of conductive nature, e.g. with known resistivity. Each hard-wired connection is connected, at one side, to one single pin of a first hardware node, and, at the other side, at least one pin of at least one second hardware node.
Each hard-wired connection (e.g. 2.4, 3.4, 4.4) may be understood as not passing through non-deterministic elements, such as queues: each hard-wired connection (e.g. 2.4, 3.4, 4.4) may be understood as queueless, because no CTBV signal shall wait for any operation simultaneously performed on any other CTBV signal.
It will be shown that, in supporting the transmission of each CTBV signals, there is in general defined a point-to-point(s) communication path, formed by a sequence of hard-wired connections (e.g. 2.4, 3.4, 4.4) and other latency-deterministic circuitry such as selectable switching circuitry (e.g. 12′), which may be, for example, controlled by hardware nodes which are communication nodes 4. In general terms, the notation 2.4, 3.4, 4.4 simply distinguishes the hardwired connections only based on the type of hardware nodes (2, 3, or 4) which are mutually connected, but the hard-wired connection(s) 2.4 are not necessarily physically distinguished from the hard-wired connection(s) 3.4, and the hard-wired connection(s) 4.4.
Each hardware node may operate according to a particular task. In the following, the following configurations are distinguished (the distinction may be hardware and/or may be unmodifiable, in some examples):
Therefore, information may be encoded in the CTBV signals, and may be passed from a processing node (primary node or I/O node) to another processing node (primary node or I/O node). An I/O node may therefore convert a CTBV signal from and/or onto one or more non-CTBV signals (which may be synchronized signals, e.g., synchronized to an internal clock signal, or to another kind of signal, and to provide the non-CTBV signal to another system and/or to convert the non-CTBV signal onto the CTBV signal). A primary node may decode information from and/or encode information onto each CTBV signal, and encode information after having performed processing. The communication nodes may support the transmissions of the CTBV signals without decoding and/or encoding them, but ensuring that the point-to-point(s) communication paths do not electrically intersect with each other or, more in general, that each CTBV signal is not queued or stalled because of another simultaneously propagating CTBV signal.
It will be shown that there is the possibility that at least one hardware node (e.g. a plurality of hardware nodes) may:
In some examples, there are only hardware nodes uniquely configured to perform specific tasks (e.g., a processing node natively uniquely devoted to operate as a processing node, and/or a communication node natively uniquely devoted to operate as a communication node). In some other examples, there are only hardware nodes selectively configurable to perform specific tasks (e.g., a hardware node operating, according to a first configuration, as a processing node, and, according to a second configuration, as a communication node, and maybe, in a third configuration, as both processing node and communication node). In other examples, at least one hardware node is uniquely configured to perform specific tasks, while at least one other hardware node is selectively configurable to perform a specific task. A hardware node may be configured (e.g. uniquely configured, or configurable) as being both a processing node and a communication node (e.g., in different terminals).
At least one hardware node of the plurality of hardware nodes may be selectably configured between:
It is also possible that a node both acts as a processing node and as a communication node, since while processing, transmitting and/or receiving at least one CTBV signal, the hardware node may also permit a latency-deterministic transmission of another signal.
In general terms, between two arbitrary processing nodes, there may be defined a point-to-point(s) communication path. Each point-to-point(s) communication path is generally constituted by a sequence of latency-deterministic hard-wired connections connected to each other through switching circuitries of communication nodes. Accordingly, each point-to-point(s) communication path is also latency-deterministic, by virtue of it being constituted only by latency-deterministic elements. It will also be shown that, while the latency-deterministic hard-wired connections are in general fixed and cannot be changed, the point-to-point(s) communication path can generally be varied by opportunely selecting communication nodes, alternated with latency-deterministic hard-wired connections, to form a chain between two arbitrary processing nodes. Therefore, according to the particular application, which will have to be carried out (e.g., an SNN, see below), it will be guaranteed that the delays caused by the hard-wired connections and the switching circuitries are determined only by their hardware implementation and configuration, but not by the contents of the data being processed.
The switching circuitry 12′ may, therefore, be latency-deterministic. The switching circuitry 12′ can delay the propagation of the CTBV signal based on the hardware configuration of the switching circuitry (12′), but not on any of the CTBV signal(s) inputted to the switching circuitry (12′).
This may be achieved, for example, by implementing the switching circuitry 12′ as an asynchronous combinatorial circuit. In particular, the asynchronous property, which does not rely on a clock signal, permits to avoid quantization latencies, which would be undetermined. The combinatorial property, which causes immediate responses to the inputs, is also appropriate for avoiding undetermined delays.
In addition or alternatively, the switching circuitry 12′ may be an electrically passive element, in particular constituted by elements (e.g. dipolar elements, such as resistors, conductors, capacitors, inductors . . . ).
(Even if the switching circuitry 12′ may cause some delays in the physical propagation of the CTBV signal(s), these delays are notwithstanding physically deterministic, and the delay of a first CTBV signal is not caused from any other simultaneously propagating CTBV signal.)
At least one switch may be a programmable switch (e.g. memristive devices, floating gate transistors; etc.), and may be programmed by the communication node 4 (e.g. by the configuration memory 420) following the configuration data. In addition or alternative, at least one switch may be a dynamic switch, and may be dynamically controlled by the communication node 4 (e.g. by the configuration memory 420) which is outside the switching circuitry 12′. In the operative phase 23 (see below) the configuration data notwithstanding control the dynamic switch to maintain a latency-deterministic behavior, thereby avoiding switching at undetermined time instants. In particular, it may be provided that, during the same operative phase 23, no switching is actuated (but the switching may be actuated during a different phase, e.g. the transition phase 22, see below). In examples, therefore, the electric configuration of the point-to-point(s) communication paths does not change in the same operative phase 23, but can uniquely change during a different phase (e.g. transition phase 22). In examples, for each operative phase 23, there may be one single CTBV signal propagating through one single point-to-point(s) communication path.
By being combinatorial, the switching circuitry 12′ may be therefore considered a queueless circuitry: there is no queue between different CTBV signals simultaneously propagating through the same switching circuitry 12′. By the same coin, the switching circuitry 12′ may be therefore considered an arbitrationless circuitry: there is no competition, between different CTBV signals simultaneously propagating through the same switching circuitry 12′, to gain access to a same resource and/or to be propagated first.
The switching circuitry 12′ may also be defined to avoid any combination between different CTBV signals simultaneously propagating through the same switching circuitry 12: therefore, parallel resources are provided for different CTBV signals propagating simultaneously (e.g. during the same operation phase 23).
By being latency deterministic, the switching circuitry 12′ may be analogic, without digitalizing the CTBV signal. The CTBV signal, therefore, merely propagates physically, and its information is encoded in its timing.
It is shown that a processing node is considered to have a processing core 12 (e.g., defined by configuration data), while a communication node is considered to have a switching circuitry 12′ (e.g., switch box) which plays the same role of the processing core 12, but which has the property of permitting the propagation through the switching circuitry terminals of the CTBV signals transmitted by the processing nodes.
In the following discussion it is mostly referred to “processing nodes” (“I/O nodes”, “primary nodes”) and “communication nodes” without distinguishing whether they are natively dedicated to their task or whether they are configurable hardware nodes which have been configured as processing nodes (I/O nodes, primary nodes) or configured as communication nodes only by virtue of the downloaded configuration data. Hence, most of the following description mostly refers to any of those hardware nodes, without distinction (unless statements to the contrary).
The semiconductor device system 100 may comprise a central controller 1. The central controller 1 may be natively and fixedly part of the semiconductor device system 100, obtained by adequately doping different layers of semiconductors in different regions, e.g. as circuits integrated in the same chip, together with the hardware nodes 2, 3, 4. The semiconductor device system 100 may comprise a host 200, or an external host 200 may be connected to the semiconductor device system 100. In the case the host is part of the semiconductor device system 100, it may be natively and fixedly part of the semiconductor device system 100, obtained by adequately doping different layers of semiconductors in different regions, e.g. as circuits integrated in the same chip, together with the hardware nodes and/or the central controller 1. Otherwise, the host 200 can be an external device connected through connection port(s). The central controller 1 and the host 200 may be, in some example, the same hardware component, but in other examples, they may be different hardware components. In general terms, the central controller 1 may operate as a master device and may provide synchronism (e.g. through a “go_ext” command 11) among the hardware nodes, while the host 200 may operate as a configuration device which configures the hardware nodes. It will be shown that the central controller 1, the host 200, and the hardware nodes may operate in time-multiplexed fashion, according to which operative phases (during which CTBV signals are exchanged, e.g. by physical propagation) are alternated to non-operative phases (during which hardware nodes reconfigure and/or download the configuration data necessary for the reconfigurations, the physical propagation of the CTBV signals, being, for example, inhibited). The central controller 1 may have the role of commanding the passages between the operative phase and non-operative phase (and vice versa), and/or the host 200 may have the role of commanding the passages into (and from) a configuration download phase (during which configuration data are downloaded, if present), and in at least one transition phase, during which the hardware nodes reconfigure and/or re-initialize. The central controller 1 and the host 200, when not being the same component, may therefore share synchronization information. For example, a “go_ext” command 11 may be sent from the host 200 to the central controller 1, so that, in some use modes, the central controller 1 is configured to send the “go” command (7) (see below) to the plurality of hardware nodes triggered by the “go_ext” command 11, instead of being triggered by the reception of the notifications of readiness to enter the operative phase (see below). The “go_ext” command 11 may use a dedicated conductive line.
The interconnect circuitry 8 (formed by both the hard-wired connections and the connection circuitries 12′, and therefore also the point-to-point(s) communication paths) is a circuit-switched implementation. Therefore, it is possible to ensure that transmission paths (point-to-point(s) communication paths) never overlap. This makes it possible to have deterministic and repeatable delays through the different point-to-point(s) communication paths, and to thus retain a high precision in the timing of the transmitted CTBV signals. Since the CTBV signals can encode different information with different (relative) timings, the fact that they are subjected to deterministic delays permits to increase information content and determinism in the communication, since the information encoded in the CTBV signals is subject to the same delays in different repetitions of the transmissions (e.g. in the same point-to-point(s) communication path).
In contrast to package-switched network-on-chip architectures, the circuit-switched interconnect circuitry also allows the point-to-point(s) transmission of individual bits (e.g., represented by high/low levels of physical values or by ascending or descending fronts in the physical values) without the need to transmit additional information such as address bits needed for dynamic routing of the signal. Since the point-to-point(s) communication paths do not overlap, no queues or arbitration circuits are needed within each point-to-point(s) transmission path. Notably, in the semiconductor device system 100 there are present package-switched communications, but they are not carried out for the CTBV transmissions, but for the download of the configuration data, i.e. for non-time-critical operations.
The CTBV interconnection circuitry 8 (formed by both the hard-wired connections and the connection circuitries 12′) gives the possibility of changing from a first point-to-point(s) communication path to a second point-to-point(s) communication path. The semiconductor device system 100 may therefore comprise at least one switching circuitry 12′, which can permit to deviate or in other ways switch from the first point-to-point(s) communication path to a secondary point-to-point(s) communication path. This switching circuitry 12′ may include a plurality of switching circuitry input terminals (or more in general switching circuitry first terminals) and a plurality of switching circuitry output terminals (or more in general switching circuitry second terminals): each switching circuitry input terminal (or more in general switching circuitry first terminal) can be selectably switched to be connected to one or more switching circuitry output terminals (or more in general switching circuitry second terminal) of the same switching circuitry. For example, one first circuit input terminal (e.g. fixedly connected to a first point-to-point(s) communication path) may be selectably connected to a first switching circuitry output terminal (also connected to the same point-to-point(s) communication path) of the same switching circuitry 12′ for a first operative phase 23 and, after a different selection required by different configuration data, the first switching circuit input terminal may be deviated to be connected to a second switching circuitry output terminal, to form a point-to-point(s) communication path different from the previous one. By selectively switching between different circuitry input (or first) terminals and different switching circuitry output (or second) terminals in the same switching circuitry 12′, it is possible to selectively vary the different point-to-point(s) communication paths according to the necessities. For example:
In general terms, it is during the operative phases 23 that the first point-to-point(s) communication paths do not overlap, while during other phases (e.g. transition phases 22, configuration phase 21) the propagations of the CTBV signals may be inhibited. During the non-operative phases (e.g. transition phases 22), the circuitries 12′ of the communication nodes 4 may be reconfigured (e.g. by switching), thereby generating different electric paths, which will permit the propagation, during the operative phase 23, of another CTBV signal.
The switching circuitry 12′ of at least one communication node 4 can be embodied by a switch box. Each switching circuitry 12′ (e.g., switch box) may include active circuit elements and/or passive circuit elements and/or compositional logic, provided that the CTBV signals, while propagating through the switching circuitry 12′, are only subjected to deterministic delays, thereby obtaining the latency-deterministic property. A switching circuitry 12′ may include, for example, passive circuit elements. A switching circuitry 12′ may include at least one active circuit element, e.g. relaying a received CTBV signal. A switching circuitry 12′ may include compositional logic. In general terms, a switching circuitry 12′ is construed in such a way it causes a deterministic delay between each input terminal and each output terminal.
According to one strategy to obtain the latency-deterministic property, each selected path between an input node to at least one output node in a switching circuitry controlled by a communication node transmits a rising or falling edge of a CTBV signal with a delay determined by the hardware implementation and current configuration of the switching circuitry, not the data (including the CTBV signal itself and any other CTBV signal passing through the communication node).
The conductors of the switching circuitry 12′ and more in general, in case, the conductors of the communication paths, may be shielded (e.g., electrically shielded) to avoid negative effects of parasitic capacitances and other interferences.
The switching circuitry 12′ may be controlled by a respective communication node 4. As shown in
However, in this case a processing node can be reprogrammed to become a communication node and/or vice versa (e.g., through different configuration data, as will be explained later).
As discussed above, there is a distinction between
It is reminded that:
Therefore, all positive or respectively all negative signal transitions in each CTBV signal are transmitted with latencies that are determined only by the implementation and configuration of the hardware nodes and hard-wired connections, but that do not depend on any of the CTBV signals being processed.
Later, it will be explained (e.g., with reference to
The “go” signal 7 may therefore cause each hardware node 2, 3, 4 to move to the operative phase 23 from the non-operative phase (20, 21, 22, 22a, 22b), and a global “stop” command (which may have a logical value which is the inverse of the “go” signal 7) may cause each hardware node 2, 3, 4 to move from the non-operative phase (20, 21, 22, 22a, 22b) to the operative phase 23. (The stop command and go command may be transmitted via the same physical signal but are using the inverse values).
Basically, the global start command 7 may be embodied by a first value (e.g. a high value) meaning that each hardware node 2, 3, 4 shall enter the operative phase 23 from the non-operative phase (20, 21, 22, 22a, 22b), and the global stop command may be embodied by a second value (e.g., low value) meaning that each hardware node 2, 3, 4 shall enter the non-operative phase (20, 21, 22, 22a, 22b) from the operative phase 23. However, the global start command and the global stop command can be implemented in another way. Here below, it is often referred to the fact that the global start command is indicated with “go”=1 and the global stop command is indicted with “go”=0. Basically, the information distinguishing between the global stop command and the global start command may be encoded in one single bit, provided simultaneously to all the hardware nodes 2, 3, 4.
More in general terms, however, the plurality of hardware nodes 2, 3, 4 may operate synchronously between the operative phase 23 and the non-operative phase (20, 21, 22, 22a, 22b). During the operative phase 23 that the hardware nodes 2, 3, 4 exchange CTBV signals, and it is during the non-operative phase that the hardware nodes 2, 3, 4, instead of exchanging CTBV signals, will e.g. either reconfigure, or download configuration data (and subsequently reconfigure), or reinitialize, or (after having performed these operations or at least one of them) wait (352) for the immediately subsequent operative phase 23 to be triggered.
Each hardware node (whether processing node, such as primary node 2 or I/O node 3 or communication node 4) may also transmit a notification 5 signalling readiness to enter the operative phase (also called “ready” signal 5). The “ready” signal 5 may be transmitted (e.g., in a dedicated point-to-point communication path), to indicate that the particular hardware node is ready to enter the operative phase 23 after having performed the operations at one of the transition phases 22, 22a, 22b, which will be explained below. The global start command 7 (“go”) may be triggered, at the central controller 1, only once the notification 5 signalling readiness to enter the operative phase 23 is being sent by the totality of the hardware nodes 2, 3, 4.
The central controller 1 may be the device which transmits the “go” signal 7 and/or receives the “ready” signal(s) 5. The central controller 1 may operate as a master device synchronizing the plurality of hardware nodes. In some examples, the central controller 1 may be embodied by one hardware node of the plurality of hardware nodes, while in other examples, the central controller 1 can be embodied by a distinct processing unit, which may be different from any of the hardware nodes 2, 3, 4. In some examples, the central controller may be superseded, e.g. in a particular operation mode (“step-mode”, see also below), by the host 200, which therefore becomes the device which transmits the “go” signal 7 and/or receives the “ready” signal 5. Or, in some examples, the host 200 operates as a master, e.g. by providing the “go_ext” command 11, which commands the (e.g. unconditioned) transmission of the “go” signal 7 by the central controller 1, e.g. without receiving any “ready” signal 5 from the plurality of hardware nodes 2, 3, 4.
As explained above, each hardware node(s) 2, 3, 4 may be configured so that different configuration data (e.g. different codes and/or different data) may be downloaded in the hardware node 2, 3, 4, in different non-operative phases to differently condition the operations of the hardware node 2, 3, 4 in an immediately subsequent operative phase 23.
Each hardware node 2, 3, 4, when entering the non-operative phase may download configuration data (through the package-switched configuration communication path 9) and write the downloaded configuration data into a local configuration memory 420. It will be shown that, in some examples, it may be that no configuration data is downloaded by a particular hardware node, while, simultaneously another hardware node may receive configuration data. During the download of the configuration data, a hardware node 2, 3, 4 may download different configuration data adapted to different operative phases, and to store the different configuration data in different memory segments, using the different configuration data in subsequent reconfigurations, without performing new downloads. In yet other cases, the same reconfiguration will be used, but only a re-initialization will be performed. It will be subsequently shown that in some examples the non-operative phase may be divided among:
In some examples, it is possible to distinguish between different transition phases, i.e. at least one of:
Alternatively, it is possible to consider that the different transition phases are one single transition phase, in which different behaviors are taken in different cases. A discussion is provided below.
At the end of the at least one transition phase, each hardware node 2, 3, 4 may transmit (to the central controller 1 and/or host 200) the “ready” signal 5, meaning that it is now ready to enter the operative phase 23. Meanwhile, the hardware node 2, 3, 4 remains waiting for the reception of the “go” signal 7 in a status of reduced power consumption.
A more detailed discussion is provided here below. It is noted that the subdivision between the configuration download phase and the at least one transition phase can be skipped in some examples, but is here notwithstanding provided for clarity and completeness.
Each hardware node 2, 3, 4 may enter the configuration download phase 21 (which may be a subphase of the non-operative phase). During the configuration download phase 21 (or more in general when downloading the configuration data) the hardware node, instead of transmitting and receiving CTBV signals, may either receive configuration data or be notwithstanding ready to receive them. The transmission of the configuration data may be carried out through a configuration communication circuitry 9 (e.g. configuration bus system) different from the CTBV interconnection circuitry 8. The configuration communication circuitry 9 may comprise, for example, a serial configuration bus 9. The communication circuitry 9 can therefore be serial and may be such that it does perform CTBV communications, or more in general may be a package-switch communication. The communication circuitry 9 may transmit data through a synchronized paradigm, such as a serial communication, or a parallel communication (e.g., with a transmission of a synchronized signal). In the configuration download phase, therefore, the configuration data may be downloaded by at least one hardware node (e.g. one of 2, 3, 4). The configuration data may include code to be executed and may define the behavior of a processing core 12 for the hardware node. The processing core 12 may be implemented, e.g., in the processing nodes 2 and/or 3, from the configuration data. In the communication node 4, the configuration data downloaded by the hardware node (e.g. from the host 200) may command which switches are to be switched within the switching circuitry 12′, e.g., how to connect the switching circuitry output (or second) terminals with the switching circuitry input (or first) terminals). It is to be noted that in particular the use of a serial bus for embodying the communication circuitry 9 (or more in general of the package-switched configuration communication path) permits to greatly reduce the hardware necessity for the different hardware nodes 2, 3, 4, e.g. in terms of resources, such as wiring. Therefore, the configuration communication circuitry 9 may follow a packet-switched paradigm.
The different configuration data may also change the operations of a hardware node 2, 3, 4, which can turn from a processing node into a communication node and/or vice versa. Basically, there may be a plurality of hardware nodes which may be reconfigured so that each, during the operative phase 23, becomes a processing node (or more in general a primary node or an I/O node) or a communication node, according to the necessities.
In general terms, for at least one (e.g. all) of the hardware node(s) 2, 3, 4, configuration data may be downloaded onto a local configuration memory 420 (see also below) of the at least one hardware node. Each hardware node may be configured to enter the configuration download phase 21 triggered by the reception of a configuration command 6 (e.g., received from the host 200). Each hardware node may be configured to enter a transition phase (22) (e.g. downloaded reconfiguration transition phase 22c) at the end of the configuration download phase (21). Both the configuration download phase (21) and the transition phase (22) are subphases of the non-operative phase.
As explained above, the configuration data may be downloaded through the package-switched configuration communication path 9 (e.g. serial bus) which may be common to more than one hardware node of the plurality of hardware nodes, each configuration data being directed to a specific hardware node (e.g., through an address uniquely assigned to each hardware node).
As can be seen, the configuration download phase 21 may be initiated from the processing phase 23 when the “go” signal goes to 0 (“global stop command”), under the condition of the “set” signal being 0 (configuration download command), while the passage from the configuration download phase 21 to the downloaded reconfiguration transition phase 22 may be triggered by the reception of the “set”=1 (transition command 6).
Among the plurality of hardware nodes 2, 3, 4, it is possible that not all of them are subjected to the download of configuration data. Accordingly, these hardware nodes may either avoid the configuration download phase 21 or remain waiting (without downloading any configuration data) during the configuration download phase (while other hardware nodes download their own configuration data), thereby saving power. In these examples, the hardware nodes may in this case simply perform a reconfiguration based on reconfiguration data previously downloaded (in this case, it is possible for the hardware node to enter a “previously-downloaded reconfiguration transition phase 22a”). In this case, the hardware node may simply make use of different configuration data previously downloaded in a previous configuration download phase 21. It will be explained that, for example, the local configuration memory 420 of the at least one of the hardware node(s) may be segmented according to a plurality of memory segments, enumerated according to a predetermined sequence (so that subsequent reconfigurations are performed using the subsequent memory segments). It is possible that different processing phases necessitate different configurations that are, notwithstanding, stored in one single configuration download phase 21 (e.g. in different memory segments of the same configuration memory 420). Accordingly, it is possible to reduce the number of downloads for each hardware node, thereby reducing the power consumptions and increasing the speed of the downloads.
In some cases, it may be that a hardware node 2, 3, 4 does not necessarily need a reconfiguration. For example, in some cases, the previous configuration data may be maintained also for the next processing phase (this may be the case of entering the non-reconfiguration transition phase 22b).
It can be understood that the distinction between initiating the configuration download 21 and initiating the transition phase 22 may be carried out based on a condition of a signal (e.g. by the presence of the configuration download command from the “set” signal).
The distinction between triggering the transition phases 22a, 22b, 22c, can be made by the internal status of each hardware node (which in turn may be conditioned by the configuration data previously downloaded by the hardware node), or by choice of the host 200. In the internal status of the hardware node there may be comprised the value of some predetermined registers.
For example, a two-program-counter strategy may be performed. For example, the hardware node may be configured so that there are more re-initializations without reconfigurations than reconfigurations. This may be obtained, for example, by using registers and/or counters, e.g. in the strategy discussed below.
In general terms, the configuration data may provide information on the future phases to be subsequently used (e.g. code to be executed, and/or other data to be used, such as runtime configurable parameter(s) to be used, see also below), so that it is pre-determined for each hardware node whether to enter the configuration download phase 21 (and subsequently the downloaded reconfiguration transition phase 22), the previously downloaded reconfiguration transition phase 22a or the non-reconfiguration transition phase 22b next.
After the boot-up 20 at least one hardware node 2, 3, 4 may either enter the configuration download phase 21 (in case of “set”=0) or the downloaded reconfiguration transition phase (in case of “set”=1), the downloaded reconfiguration being the previously downloaded reconfiguration (which may be the full configuration).
In general terms, the difference between the three behaviors of each hardware node may be:
Several strategies can be used to further reduce the power consumption. For example, the configuration memory 420 (see below) may be of the static type (e.g., avoiding the necessity of refreshing). The configuration memory 420 may be, for example, implemented in a flash memory or in another memory that is not transitable (non-volatile memory, non-transitory memory) and which does not continuously necessitate the provision of electric power. In general terms, the fewer operations are need to be performed to complete the reconfiguration of the hardware (and the quicker the node starts waiting at 352), the more the energy saving is achieved. The same may apply, for example, to the core 12 and/or the switching circuitry 12′.
In general terms, it may be that the various hardware nodes 2, 3, 4 of the semiconductor device system 100 do not receive a central clock signal. This permits to substantially reduce the power consumption. However, it is possible that at least one of the hardware nodes (e.g., processing node such as primary nodes or I/O node or communication node) has, internally, its own clock or receives an external clock signal. It is also possible that two or more hardware nodes (e.g. in the vicinity with each other) share a clock signal. In general terms, however, the hardware nodes of the semiconductor device system 100 do not need to share a clock altogether. Notably, the synchronization through the “go” signal 7 is not clocked: between the “go” signal 7 (triggering the operative phase 23) and the immediately subsequent global stop command (terminating the operative phase 23), there may be no clock signal centrally transmitted (e.g. from the central controller 1) to each of the hardware nodes 2, 3, 4, and the operations of the elements 410, 411, 420, etc. in each hardware node 2, 3, 4 are not performed based on a centralized clock.
Moreover, another advantage of the present solution is that all the hardware nodes 2, 3, 4 are transparently and easily reconfigurable, with clearly separated and synchronized reconfiguration phases that do not interfere with the continual asynchronous processing of CTBV signals.
The architecture is also scalable, because it is possible to connect and reconnect different hardware nodes in different fashions according to different designs, and reconfigure them any time in a time-multiplexed fashion.
Moreover, determinism is increased, and the CTBV transmissions are impaired by the same latency, when propagating along the same point-to-point(s) communication path.
In the above discussion, it is mostly imagined that the set signal can be set high or low at any time. The presence of a device (e.g. host 200) that sets the “set” command to high or low can (thereby controlling the configuration download command) also be avoided when the semiconductor device system 100 is marked, in the cases in which the semiconductor device system 100 does not need any reconfiguration anymore.
In some cases, the set signal and the configuration data may be transmitted from the host 200, which may be not the central controller 1, and which could be or not be part of the final marketed device. Notwithstanding, the semiconductor device system 100 can also be operated without a host 200, e.g. if no reconfiguration is needed. In those cases, it may be that the signal set is always set high (to 1) and the configuration download phase 21 and the download reconfiguration transition phase 22 never occur when the semiconductor device system 100 is operative. It is also possible that the host (transmitting the set signal and/or the configuration data to be downloaded) can be either the central controller 1 (or part of the controller 1) or another device internal to the semiconductor device system 100.
In case, e.g., of the previously downloaded reconfiguration transition phase 22a, the configuration data are not downloaded, but previously downloaded configuration data (already stored in the configuration memory 420) may be used for performing a new reconfiguration. In case a non-reconfiguration transition phase 22b is entered, the same configuration used in the immediately preceding reconfiguration is used without even performing a reconfiguration; simply some registers or internal buffers are re-initialized and potentially some internal counters are updated.
The operations of reconfigurations (which are based on the configuration data) stored in the configuration memory 420 of the communication node 4 may include, for example, instructions on how to switch the switching circuitry (e.g., which switching circuitry input terminal is to be connected to which switching circuit output terminal(s) for each of the plurality of switching circuits input and output terminals); on the other side, in the processing node (e.g., the primary node 2, but it could also be the same for the I/O node 3) the configuration memory 420 may cause, when a configuration is performed, a redefinition of the processing core 12. On the other side, the configuration data may have code which is to be executed by a processor implemented in and/or embodying the processing core 12.
The configuration memory 420 may include a plurality of memory segments, e.g. enumerated according to a predetermined sequence. This can permit, for subsequent occurrences of the transition phase (e.g. in the case of the previously downloaded reconfiguration transition phase 22a), to make use of previously-downloaded configuration data currently stored in subsequent memory segments of the configuration memory 420. In fact, it is possible to store configuration data regarding different reconfigurations in different (e.g., subsequent) memory segments of the configuration memory 420. Accordingly, each time a new transition phase (e.g., in the case of a previously-downloaded reconfiguration transition phase 22a) is entered, a new memory segment (e.g., the next memory segment in the predetermined sequence) is used for performing the reconfiguration. Therefore, during a single configuration download phase 21 (or more in general during one single download), multiple memory segments of the same configuration memory 420 of the same communication node (2 and/or 4) may be written with different configuration data. Accordingly, the first memory segment as enumerated according to the sequential numeration is written with the configuration data pertaining to the first reconfiguration to be performed in the downloaded reconfiguration phase 22 immediately following the download, while the subsequent memory segment are written, according to the predetermined sequence, with the configuration data pertaining to the subsequent reconfigurations to be performed in the subsequent occurrences of the subsequent transition phases (e.g. in the previously downloaded reconfiguration transition phases 22a).
With reference to
It is noted that a CTBV transmission over the sequence of hard-wired connections is latency-deterministic, such that information can be encoded in the timing of each CTBV signal. Each CTBV signal therefore propagates (during an operative phase 23) without its timing being impaired by the propagation of any other CTBV signal propagating simultaneously (in the same operating phase 23).
In the present examples, whether the sender (transmitting hardware node) and the receiver (receiving hardware node) share, or not, the same timing reference (clock) for encoding and decoding the information from the CTBV signal, this is not of importance. What is eminently important is that the transmission is latency deterministic from the sender to the receiver: each transmission of CTBV signal does not share any hardware (wires, switching circuitry) with other transmissions CTBV signal, and is therefore traffic-independent.
According to aspects, the hardware nodes (and in particular transmitter and the receiver) do not share any clock.
According to aspects, the connections defined by the switches of circuitries 12 and 12′ are static during the whole elongation of the operative phase 23, and they are only changed during the non-operative phase (e.g. the transition phase).
As explained above, it may be that the hardware nodes configured as processing nodes are natively (e.g. structurally) implemented as processing nodes (e.g. they cannot be used as communication nodes). Or, it may be that the hardware nodes configured as I/O nodes are natively (e.g. structurally) implemented as I/O nodes (e.g. they cannot be used as primary nodes and/or as communication nodes). Or, it may be that the hardware nodes configured as primary nodes are natively (e.g. structurally) implemented as primary nodes (e.g. they cannot be used as I/O nodes and/or as communication nodes). In addition or alternative, it may be that the hardware nodes configured as communication nodes are natively (e.g. structurally) implemented as communication nodes (e.g. they cannot be used as processing nodes).
In examples in which the hardware nodes configured as processing nodes are natively (e.g. structurally) implemented as different from the hardware nodes configured as communication nodes, some implementations may permit to achieve advantages with respect to energy and area efficiency. Further, speed could be increased. In particular, it may be that the hardware nodes which are natively-implemented as processing nodes may have a circuitry for encoding/decoding (and/or transmitting/receiving) the CTBV signals (e.g. a circuitry for performing the spiking decision), but lacks circuitry for performing the switching. On the other side, the hardware nodes which are natively-implemented as communication nodes may have a circuitry for performing the switching, but they lack the circuitry for encoding/decoding (and/or transmitting/receiving) the CTBV signals. Notably, the switching by the communication nodes is not performed during the operating phase, and therefore the circuitry of the communication nodes is mainly off during the operating phases. On the other side, the transmission of CTBV signals by the processing nodes may be only performed during the operating phase, and therefore the circuitry of the communication nodes be only mainly off apart from the operating phase. This leads to an increased area efficiency.
In these examples, it may be that the communication nodes are spatially grouped in proximity with each other, while the processing nodes are spatially grouped in proximity with each other, but spaced apart from the communication nodes. Therefore, there may be at least two distinct portions of the semiconductor device system: one first portion, hosting the communication nodes spatially grouped in proximity with each other (but not hosting any processing node), and a second portion, distinct and spaced apart from the first portion, hosting the processing nodes spatially grouped in proximity with each other (but not hosting any communication node). Each of the two portions may be supplied by a dedicated supply line.
However, the inventors have understood that it is even more advantageous if there is a vicinity between one single communication node (e.g. a natively-implemented communication node) and one single processing node (e.g. a natively-implemented processing node), see
Therefore, there may be a first succession with a plurality of communication nodes and a second succession with a plurality of processing nodes are placed in an interleaved arrangement such that each processing node of the second succession is placed in direct proximity of at least one communication node of the first succession.
Spiking Neural Networks
The semiconductor device system 100 may implement a spiking neural network, SNN. The SNN may comprise a plurality of neurons. Each neuron may have runtime configurable parameters (e.g. offset, gain, activation function, etc.). Each neuron may output at least one CTBV signal processed as determined by the runtime configurable parameters. The SNN may include a plurality of synapses between the neurons, each of which may have runtime configurable parameters (e.g. synaptic weight, transmission probability, delay, etc.). Each synapse may provide an input signal to a neuron. The input signal may be an output signal of the same neuron, or from another neuron. At least one neuron of the plurality of neurons may be implemented in one unique processing node (e.g. one single primary node 2) among the plurality of processing nodes. At least one synapse may be implemented, in part, in a respective point-to-point(s) communication path of the plurality of point-to-point(s) communication paths in the CTBV interconnection circuitry 8, and in part inside the processing node of the at least one source and/or target neuron connected to the synapse. CTBV signals may be transmitted by each processing node (e.g. primary node) 2 to encode the values outputted by the neurons. The runtime configurable parameters (e.g. neuron parameters and synapse parameters) may be provided to the respective processing node (e.g. primary node 2), for example, by download (e.g. during the configuration download phases 21). In the transition phase (e.g., 22, e.g. 22c), the downloaded runtime configurable parameters may be uploaded in the core 12, so that they are used in the subsequent operative (processing) phase 23. It is possible, for example, to download, in one single download, a plurality of configuration data with different runtime configurable parameters, to be used in different operative phases 23. In the SNN, the inputs may be provided to I/O nodes 3 used as input, and the system 100's outputs may be provided to I/O nodes 3 used as output. A plurality of hidden layers or recurrent connections may be provided by a plurality of primary nodes 2. The plurality of primary nodes 2 may be adequately interconnected according to point-to-point(s) communication paths e.g. conditioned by the switching controlled by the communication node(s) 4, thereby embodying the synapses.
The host 200 may control the functioning of the SNN. In particular, the host 200 may:
The host 200 may assign, to each individual neuron or a plurality of neurons, one hardware node, and to each synapse, one point-to-point(s) communication path, and/or provide, to each hardware node, runtime configurable parameters, as part of the configuration data.
The host 200 may control at least one training session onto the semiconductor device system 100, during which different outputs signals are examined for different input signals and different runtime configurable parameters. The host 200 may evaluate the input and output signals and runtime configurable parameters according to a given cost function and optimize the runtime configurable parameters so as to minimize the cost function. In some examples, once the training session is performed, the host 200 may be disengaged (disconnected), and the configuration sessions 21 (as well as the “set” command 6) may be not used anymore (while the semiconductor device system 100 may continue to operate). In other examples, the host 200 may remain as part of the semiconductor device system 100 (or as a device permanently engaged to the semiconductor device system 100) and continue operating.
It is also noted that the latencies, which are in general impaired by the different conductivities of the different point-to-point(s) communication paths (e.g., a long point-to-point(s) communication path has in general a higher resistance and capacitance than a short point-to-point(s) communication path) can be compensated in applications such as a SNN. If, for example, a synapse has a large resistance, the calculation of its runtime configuration parameter(s) (e.g. weights) can also take into account the larger resistance, and the training (learning) process can therefore be performed under consideration of the runtime configuration parameter(s) with that particular large resistance. After the training phase, the same synapse will have the same point-to-point(s) communication path and the same resistance, as well, and therefore the runtime configuration parameter(s) will deterministically suite to the particular synapse.
By virtue of the latency determinism, it is possible to grant that two different SNNs with same hardware, same configurations and same runtime configuration parameters will behave in identically. Therefore, the results of a training can also be transferred another identical device.
Even more advantageously, it will be possible to train an SNN on another device (CPU, GPU) with knowledge about the delays which would occur when applying it on the invented device.
So to say a hardware-aware training. The runtime configuration parameters obtained through the hardware-aware training will then be provided to the hardware nodes during initial configuration phases (and the host could also be disconnected after having provided the runtime configuration parameters).
Further Discussion
The Present Solution
We invented a system architecture and control flow for processing data, primarily time-varying signals, using distributed nodes with user-defined functionality that communicate via a configurable asynchronous digital communication network. The distributed hardware nodes 2, 3, 4 can run concurrently without need for a global clock, and they communicate through globally asynchronous continuous-time binary-value (CTBV) signals. These CTBV signals may be routed through a CTBV interconnection circuitry (e.g. a configurable grid-like network of multi-wire buses with constant and predictable delays for a given configuration). The operation of the semiconductor device system 100 may be broken up into distinct processing phases, which may allow time-multiplexed operation. The characteristics of the semiconductor device system 100 allow encoding of information directly in the timing of CTBV signals within processing phases. The proposed control flow enables self-controlled re-configuration of the semiconductor device system 100 based on local configuration memories (e.g. 420) of hardware nodes 2, 3, 4 as well as receiving external updates (e.g. new configuration data) for the local configuration memories 420 between two different processing phases.
Here, there are mostly discussed a) the set of interfaces of the hardware nodes, and/or b) the proposed control hierarchy and/or c) the processing scheme, which preferably together support the time-multiplexed processing of precisely timed CTBV signals. We also provide d) specific implementation examples of system components.
Specifically, we propose a system comprising a central controller (1), I/O nodes (3), primary nodes (2) and communication nodes (4) that operate in parallel without need for a global clock. Each processing node (which may be a primary node or a I/O node) and each communication node is connected to a common configuration interface (9), common control interface (embodying the group of at least one among the “go” signal 5, the “ready” signal 6, and the “set” signal 7) and a common data interface (CTBV interconnection circuitry 8), each described in detail below, as well as a core (12) that determines the behavior of the hardware node. The core 12 may be defined by the configuration data downloaded by the hardware node. The core 12 can be digital, analog or mixed-signal.
Once configured, each hardware node 2, 3, 4 may run continuously through a sequence of (e.g. three) phases. Each operation of entering in one of these phases acts as a synchronization barrier and may be triggered through special flags in the control interfaces of system's internal nodes and/or the host 200. During configuration download phase 21, (configuration) data can be received (e.g. downloaded) from the host 200. During transition phase 22 (or any of its instantiations 22a, 22b, 22c), each hardware node can perform internal operations (reconfiguration or re-initialization) to prepare for the next processing phase 23. During processing phases 23, all hardware nodes operate asynchronously and communicate via the data interface (CTBV interconnection circuitry 8).
Architecture Components
Each primary node 2 and at least one I/O node 3 may receive incoming CTBV signals from a multi-wire input port (arrived along one point-to-point(s) communication path). Each primary node 2 and at least one I/O node 3 may produce outgoing CTBV signals on a multi-wire output port (connected to a one point-to-point(s) communication path). The in- and output ports of each processing node 2, 3 may be, in some implementations, connected to corresponding ports on at least one communication node 4 (this appears, for example, in
At least one communication node 4 may be additionally connected to several other communication nodes via the same type of interface (through the CTBV interconnection circuitry), which may include a multi-wire bus, and/or communicates with these connected communication nodes 4 via the same type of CTBV signal (this appears, for example, in
An implementation example of a communication node 4 with constant and predictable transmission delays is a communication node 4 including or controlling a switching block, which can use any combination of static connections and dynamically programmable switches to connect its CTBV inputs to its CTBV outputs. The dynamic switches can be set (e.g. during a transition phase 22, or 22a, 22b, or 22c) in dependence on the communication node's configuration and thus provide flexibility for programming different connection patterns and time-multiplexing between them. Static connections provide an area efficient means to forward CTBV signals directly through a communication node. By combining static and dynamically switched connections, such a switching block (e.g. 12′) can be designed to support any degree of flexibility, starting from a static one-to-one connection pattern between one input and one output each, up to a fully programmable implementation that allows arbitrary connection patterns between inputs and outputs. Additionally, the input and/or output CTBV signals of each communication node could be regenerated to the full signal amplitude, in order to counteract any attenuation effects. A connected set of such communication nodes can be interpreted as a switching fabric, which is a special case of the data interconnect system (CTBV interconnection circuitry 8).
User-defined cores 12 of each I/O node 3 can support, in principle, arbitrary types of inputs and outputs (10), provided that they convert the in- and outputs to and from the CTBV format required by the data interface (CTBV interconnection circuitry 8). An I/O nodes 3 can be connectable to dedicated pins of a physical chip, an internal digital bus, inter-chip-interconnect or any other source/sink of inputs/outputs. Each I/O node 3 may be associated with at least one communication node.
Each primary or communication node 2, 3, 4 may be additionally connected through its configuration interface to a configuration bus system (or more in general configuration communication circuitry) that is independent of the data interconnect system (CTBV interconnection circuitry 8). No constraints are imposed on how the cores 12 and 12′ of processing and communication nodes 2,33, 4 use the configuration bus system (or more in general configuration communication circuitry) 9, but one intended usage is outlined in the section “Program execution”.
The distinction between I/O, primary and communication nodes can be merely conceptual and may simplify the description of the semiconductor device system 100; formally, all communication nodes 4 and I/O nodes 3 could be interpreted as special implementations of the hardware nodes, in which case the entire system 100 can also be described entirely in terms of hardware nodes.
Program Execution
A program is executed on the proposed architecture by going through several distinct phases. Entering new phases may occur by the globally broadcasted set (6) and go (7) flags. Each node can signal its own readiness for phase transitions to the central controller through separate ready (5) flags. If operated by the external host 200, the host 200 controls the set flag (6) as well as an additional ready flag (5), which allows the host to delay phase transitions as long as desired by keeping the flag low or high, respectively. If operated without the external host 200, the set flag (6) is not needed and can, for the rest of this text, be assumed to be constant high (1).
In some examples, central controller 1 can operate in two modes:
Boot-Up and Reset
After boot-up, the semiconductor device system 100 is reset. After reset of the semiconductor device system 100, each hardware node 2, 3, 4 initializes the ready flag in its control interface to low (0), and the central controller 1 initializes the globally broadcasted go flag to low, as well (boot-up phase 20). This sets up the semiconductor device system 100 for a first configuration download phase 21.
Configuration Download Phase
When operated with the host 200, the host 200 controls the broadcasted set flag. When the set flag is low while the go flag goes low (or if both set and go flag are initialized to zero on boot-up), a configuration download phase (21) may be started.
During a configuration download phase 21, hardware nodes 2, 3, 4 can receive (by download) new (updated) configuration data from an external source (e.g. the host 200), via their configuration interfaces, through the configuration communication circuitry 9. The configuration download phase 21 can thus be used for overwriting node-internal memories (e.g. configurations of processing and communication nodes and input data for I/O nodes 3) without the risk of collisions with read accesses that can occur during the other phases.
Once the host 200 sets the set flag to high, the configuration download phase 21 ends and a transition phase begins. If the set flag is already high when the go flag goes low (e.g. if the set flag is held constant high), the configuration download phase 21 may be skipped entirely in some examples, and the next transition phase 22 begins immediately.
Transition Phase
In the transition phase (22) each hardware node 2, 3, 4 can, but does not have to, perform internal (re-)configuration as determined by the user-defined core of the hardware node. Once a hardware node 2, 3, 4 has completed its reconfiguration, the hardware node 2, 3, 4 may set the ready flag (5) in its control interface to high.
In eager execution mode, the central controller 1 may monitor the hardware nodes' (and, if available, the host's) ready flags, and once all of these flags have been set, the central controller immediately sets the go flag to high, which ends the current transition phase and signals to all nodes that the next processing phase has begun.
In step-mode, the go flag is instead controlled by the host via an external flag go_ext (11). This means that the transition from transition to processing phase (and vice versa) is explicitly triggered from the outside by setting or clearing that flag. The step mode may be useful, for example, for performing debugging, and/or if an external input signal is provided and needs to be processed at fixed intervals, in particular if the user-defined cores 12 and 12′ run continuously and do not generate any ready signals themselves.
Processing Phase
During a processing phase (23), each processing node 2, 3 asynchronously processes its CTBV input signals and/or produces CTBV output signals as determined by the core 12. Once a hardware node is ready to end the current processing phase 23 (this is also determined by the core 12 of the hardware node), it clears the ready flag again.
E.g. when in eager execution mode, the central controller signals the end of the current processing phase by clearing the global go flag again once all nodes (and the host, if any) have cleared their respective ready flags.
E.g. when in step mode, the processing phase continues until the host clears the go flag.
Clearing the go flag may trigger the beginning of the next configuration download phase 21 (e.g. if the set flag is low), or it causes the configuration download phase to be skipped, and the hardware node may proceed to the next transition phase 22 (or 22a, 22b, 22c), e.g. if the set flag is high.
This cycle of alternating (optional) configuration, transition and processing phases may repeat indefinitely.
The Control Interface
Each hardware node 2, 3, 4 may contain the same control communication interface, comprising at least one of the global go flag, which can only be read by the hardware node, and the local ready flag, which can be set and cleared by the hardware node. As described previously the communication interface can be extended by a set flag, which can only be read.
Implementation Example: Static Node
An example implementation of a non-configurable, non-resettable processing node would set its ready flag as the inverted go flag and ignore the set flag. Excluding individual non-configurable and non-resettable nodes completely from this handshake would also be possible as long as they are able to continuously process CTBV signals.
Implementation Example: Reconfigurable Node with Phase Counters
Another implementation example of a reconfigurable hardware node 2, 3, 4, which is also able to perform an internal initialization between processing phases, is shown in
At the beginning of each transition phase, the counter value of phase_count may be checked. If it is zero (301), then the next_config flag may be toggled (320). This causes the counter value of config_count to be incremented e.g. by one (340), when it is not equal to loop_end (330), or reset to loop_start (341). config_count then may toggle the reconfig flag (350), which triggers a reconfiguration process (351) of the processing core 12 or 12′ from the configuration memory 420, that is not limited. Considered options are the usage of a memory access handshake or handshake-less memory where the data stored at config_pointer is directly available to the core. A reconfiguration can include setting phase_count to a new value. After reconfiguration the processing core sets ready to high, and remains waiting at 352.
If phase_count is not zero (300), then it may be decremented by one (310) and the clear flag may be toggled (360). The clear flag may trigger the processing node to perform an optional reset (361) of internal states and set ready to high after that.
This invention does not limit whether the ready flag is used as a handshake answer to reset the reconfig or clear flag, or the user-defined processing core is double-edge sensitive. It also does not limit whether this control interface is implemented in synchronous logic with a global or local clock or asynchronous logic.
The Configuration Interface
Each hardware node 2, 3, 4 may have a configuration interface that connects it to the configuration bus system (configuration communication circuitry 9).
The Data Interface
Each processing node 2, 3 exchanges data with one or more associated communication nodes 4 and/or other processing nodes 2, 3 e.g. through a multi-wire bus, or more in general with a CTBV interconnection circuitry 8 (e.g. segments 2.4). Communication nodes 4 may be connected e.g. to each other via the same type of connections (e.g. segments 4.4 in
Input and Output
The possible input/output interfaces used in I/O nodes 3 are not limited by this invention. It is also not limited when I/O nodes 3 receive or send external input or output data and how they convert the in- and outputs to and from the CTBV format required by the data interface (CTBV interconnection circuitry 8). I/O nodes 3 simply send and/or receive CTBV signals via their data interface (CTBV interconnection circuitry 8) e.g. during processing phases 21 and may implement a ready-set-go or, if no set flag is used, the reduced ready-go phase transition logic.
One possible use-case is to use I/O nodes 3 to wait in a configuration phase 21 or transition phase 22 (or any of its possible instantiations, such as 22a, 22b, 22c) and thereby delay the start of a new processing phase until new input data is available. In combination with an I/O node configuration that defines whether an I/O node requires new input data for the next processing phase, this can be used for an input-driven program execution. Adding FIFO-like input or output buffers would also allow for asynchronously receiving input and sending output data, while processing already available data.
Another specific use of I/O nodes 3 may be to support caching intermediate results of time-multiplexed processing phases, i.e. to (approximately) record CTBV signals during one phase and to output them during a later phase. To support this behaviour, an I/O node 3 needs to be able to store the time-course of an CTBV signal with an application dependent precision. A synchronous digital implementation example of this is a shift register, that approximately records a CTBV signal by shifting in a 1-bit during each clock cycle where the observed CTBV signal was high, and a 0-bit otherwise. The resulting bit-sequence can be stored and later used to reproduce a CTBV signal by iteratively shifting out the stored bits, and outputting a high signal for the duration of one clock cycle for each 1-bit, and a low signal for each 0-bit.
Hierarchical Application
We foresee two different ways in which our proposed architecture system architecture and control flow can be incorporated into a larger hierarchical system architecture.
First, it is possible to instantiate any type of user-defined processing core within a processing node, as long as it is compatible with the previously described data, control and configuration interfaces. That means that arbitrarily complex subsystems including multi-core architectures can be nested inside processing nodes. The internal communication inside such subsystems is not limited to CTBV communication. For example, in the context of processing SNNs a user-defined processing core might contain a neuromorphic circuit implementation of a single neuron or a group of neurons with associated synaptic connections, or a circuit that otherwise emulates the operation of a neuron or group of neurons.
A second extension of the architecture can be realized by connecting multiple subsystems (52), each of which follows the system architecture and control flow described above, via an arbitrary I/O interface (510) connected to the subsystems' I/O nodes. In this case, the communication between the subsystems does not have to obey the CTBV format, and can instead be digital (synchronous or asynchronous), analog, optical, wireless radio or others. A global central controller (51) is used to apply the ready-set-go phase transition sequence globally to all subsystems by reading all ready flags (55) and overwriting the subsystems' go_ext flags with a global go_glo flag (57). The set flag (6) is broadcasted from the host to all subsystems. The host can also use a shared configuration bus (9) connected to the configuration interfaces of the subsystems.
Simulation
A first simulation setup to verify basic functionality of the proposed architecture and control flow was implemented with SystemC. The implemented architecture consists of one host, a central controller, two I/O nodes, two processing nodes and four communication nodes. Each hardware node has a configuration memory containing two different configurations. While I/O nodes 3 are switching their firing rate and stimulus duration per processing phase between configurations, primary nodes 2 are switching their “neuron” thresholds and communication nodes 4 are switching their connectivity as shown in
The setup of the simulation is shown in
Conventional Technology of Implementation Example: Reconfigurable Node with Phase Counters
The usage of local program memory with program counters (phase_count and config_count are basically a two-stage program counter) has some advantages e.g. in combination with the ready-set-go handshake discussed above. For the application of spiking neural networks logical neurons are only reprogrammed to physical ones between time-multiplexing steps, instead of being loaded within processing phases. The latter is typically the case for traditional neuromorphic hardware, like Intel's Loihi chip, where spikes are assigned with addresses and the parameters of the correct logical neuron are loaded from local memory in dependence on the address whenever a new spike is processed. There is currently no architecture where local program counters and program memory are synchronized via a global handshake to advance during (e.g.) transition phases, but being inactive during processing.
Application for Spiking Neural Networks
When applying the hardware architecture to the application of Spiking Neural Networks the described abstract nodes can be specified as follows.
A processing node may represent one or multiple spiking neurons, which process incoming CTBV signals containing binary spikes, whereby the length of a spike as well as its timing might contain information. When turned “on” (being in a processing phase), neurons receive incoming spikes, process incoming spikes and send out generated spikes globally asynchronously without waiting for a global synchronization signal. That typically requires a mixed-signal implementation to avoid arbitration problems (or short circuits), which would distort the timing of incoming spikes, when simultaneously processing multiple inputs. A neurons' configuration can contain parameters regarding the neuron itself (threshold, leakage, offset, etc.) but it may also contain the weights of a neuron, if the synapses are implemented within the neuron itself and each input synapse of a neuron is associated with one CTBV input terminal of the neuron. A reason for synapses to be implemented within neurons and not for example within the switch boxes may be that a weighting of the input CTBV signals within synapses either requires the transmission of a spike pulse with variable amplitude (which is no longer a CTBV signal) or length (which can result in collision/overlapping if implemented in switch boxes). Therefore a solution may be implementing synapses in conjunction with the target neuron inside a single processing node, where the result of the weighting of a CTBV spike does not have to comply with CTBV signal constraints anymore, but can be a continuous value voltage or current, etc.
The communication nodes 4 may be implemented with configurable switch boxes (e.g. 12′), which together form a switched fabric network. A circuit-switched communication is used instead of package-switched communication, which is dominant in conventional technology of neuromorphic hardware, because circuit-switched communication has (for one configuration) only constant delays and no traffic dependent delays. The number of inputs and outputs of communication nodes (switch boxes) in relation to the number of inputs and outputs of processing nodes (neurons) defines the number of possible connections (synapses) that can be realized by the system when implementing a Spiking Neural Network. Therefore, the optimal number of inputs and outputs for the communication nodes as well as the optimal number of inputs and outputs of the processing nodes depends on properties of the class of networks to be executed (e.g. the fan-in and fan-out of neurons, the size of the network and its topology).
I/O nodes 3 can be implemented, for example, in a sample-based input fashion or an always-on fashion. A sample-based input fashion could mean, for example, that they receive either arbitrary input data representing one discrete data sample, which they subsequently convert it into spike trains, or receive spike trains (or digital approximations thereof) that represent one discrete input sample. The neurons and switch boxes could, for example, either stay “always-on”, meaning they are not being reset between samples or they can be synchronous, so that a global synchronization signal (e.g. ready-set-go) can be used to reset nodes between samples.
In addition, the ready-set-go handshake can, for example, also be used to execute larger networks on the chip in a time-multiplexed fashion than supported in parallel by the number of physical neurons and connections, when neurons and switch boxes have local memory from which they can reconfigure themselves. This time-multiplexed mode may require the I/O nodes 3 to be able to buffer spike trains of one time-multiplexing step and feed them to the next step.
Another use-case may be, for example, the always-on operation of the spiking neural network, in which the complete network can be mapped in parallel onto the systems processing and communication nodes, the I/O nodes receive spike trains as input and produce spike trains as output, and no time-multiplexing or reconfiguration is needed during operation. The neurons and switchboxes then only need one initial configuration and can continuously process spike events.
A big challenge when applying an always-on spiking neural network to a real-time application is that the temporal dynamics determined by parameters of the neural network model, and therefore also the temporal dynamics of its hardware implementation, must meet the timing requirements of the application. This is a yet unsolved problem, since most current approaches implement spiking neural networks in discrete time or globally synchronized processing (pipeline) steps (e.g. Intel Loihi, IBM TrueNorth), which have no strict relation to physical time. Our proposed system 100 instead allows the user-defined cores 12 to incorporate analog circuits or digital circuits with local clock that can realize arbitrary temporal dynamics as required by the use-case.
Mode of Operation in (Neuromorphic) SNN Hardware Accelerators
An important feature of our architecture is its specific mode of operation, e.g. for executing SNNs. Neuromorphic SNN accelerators typically operate in two phases, a configuration phase (e.g. after reset) during which parameters are loaded into memory, and an inference phase, during which data is processed with the given set of parameters. A time-multiplexed operation therefore requires alternating between these two phases, i.e. repeatedly loading the currently valid configuration from a central memory or a buffer. We are not aware of a neuromorphic SNN accelerator that makes use of a stage-transition signaling and hand-shaking mechanism similar to the one proposed in our solution.
Characteristics
Some Aspects
Effects and Benefits
Technical Application Domain
It is noted that the present examples shall not be confused with FPGAs. FPGAs do not provide any special circuitry to support the above-mentioned runtime reconfigurability of the system, separation of CTBV signals and configuration data, and the self-timed operation as explained later.
Further Characterization of the Figures:
Some Final Summarizing
The semiconductor device system 100 may be instantiated, for example, by a chip (with internal or external host 200). The hardware nodes 2, 3, 4, the circuitry 8 (including the hard-wired connections), and the central controller 1 may therefore be implemented in ASIC in the same chip.
In general terms, the hard-wired connection, the switching circuitries, and the hardware nodes exchange electric communications (and the CTBV signals are electric signals, as well).
Here below a quick review on the several wirings that may be used:
The implementation in hardware or in software may be performed using a digital storage medium, for example cloud storage, a floppy disk, a DVD, a Blue-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some examples according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, examples of the present invention may be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine-readable carrier.
Other examples comprise the computer program for performing one of the methods described herein, stored on a machine-readable carrier. In other words, an examples of the method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further examples of the methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. A further example is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet. A further examples comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein. A further examples comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some examples, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some examples, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents, which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
22189135.1 | Aug 2022 | EP | regional |