The present disclosure relates generally to electronic circuits, and relates more specifically to, e.g., parallel computer design, parallel programming models and systems, interconnection-network design, field programmable gate array (FPGA) design, computer architecture, and electronic design automation tools.
The present disclosure pertains to the design and implementation of massively parallel computing systems. In an embodiment the system is implemented in a system on a chip. In an embodiment the system is implemented in an FPGA. The system employs a network-on-chip (“NOC”) interconnection to compose a plurality of processor cores, accelerator cores, memory systems, diverse external devices and interfaces, and hierarchical clusters of processor cores, accelerator cores, memory systems, and diverse external devices and systems together.
To date, prior art work on FPGA system-on-a-chip (SOC) computing systems that comprise a plurality of processor cores has produced relatively large, complex, and slow parallel computers. Prior art systems employ large soft processor cores, large interconnect structures, and unscalable interconnect networks such as buses and rings.
In contrast, an embodiment of the present work, employing a particularly efficient, scalable high bandwidth network on a chip (NOC) designated a “Hoplite NOC” and comprising FPGA-efficient, directional 2D routers designated “Hoplite routers”, particularly efficient FPGA soft processor cores, and an efficient, flexible, configurable architecture for composing processor cores, accelerator cores, and shared memories into clusters, and that communicate via means including direct coupling, cluster-shared memory, and message passing, achieves, comparatively, orders of magnitude greater computing throughput and data bandwidth, at lower energy per operation, implemented in a given FPGA.
In this Autumn of Moore's Law, the computing industry is challenged to scale up throughput and reduce energy. This drives interest in FPGA accelerators, particularly in datacenter servers. For example, the Microsoft Catapult system uses FPGA acceleration at datacenter scale to double throughput or cut latency of Bing query document ranking. [3]
As computers, FPGAs offer parallelism, specialization, and connectivity to modern interfaces including 10-100 Gb/s Ethernet and many DRAM channels including High Bandwidth Memory (HBM). Compared to general purpose CPUs, FPGA accelerators can achieve higher throughput, lower latency, and lower energy per operation.
There are at least two big challenges to development of an FPGA accelerator. The first is software: it is expensive to move an application into hardware, and to maintain it as code changes. Rewriting C++ code in Register Transfer Language (RTL) is painful. High level synthesis maps a C function to gates, but does not help compose modules into a system, nor interface the system to the host. OpenCL-to-FPGA tools are a step ahead. With OpenCL developers have a software platform that abstracts away low level FPGA concerns. But “OpenCL to FPGA” is no panacea. Much important software is not and cannot be coded in OpenCL; the resulting accelerator is specialized to particular kernel(s); and following a simple edit to the OpenCL program, it may take several hours to re-implement the FPGA through the FPGA synthesize, place, and route tool chain.
To address the diversity of workloads, and for faster design turns, more of a workload might be run directly as software, on processors in the FPGA fabric. Soft processors may also be very tightly coupled to accelerators, with very low latency communications between the processor and the accelerator function core. But to outperform a full custom CPU can require many energy-efficient, FPGA-efficient soft processors working in tandem with workload accelerators cores.
The second challenge is implementation of the accelerator SOC hardware. The SOC consists of dozens of compute and accelerator cores, interconnected to each other and to extreme bandwidth interface cores e.g. PCI Express, 100G Ethernet, and, in the coming HBM era, eight or more DRAM channels. Accordingly, an embodiment of a practical, scalable system should provide sufficient interconnect connectivity and bandwidth to interconnect the many compute and interface cores at full bandwidth (typically 50-150 Gb/s per client core).
Actual acceleration of a software workload, i.e. running it faster or with greater aggregate throughput than is possible on a general purpose ASIC or full-custom CPU, motivates an FPGA-efficient soft processor that implements a standard instruction set architecture (ISA) for which the diversity of software tools, libraries, and applications exist. The RISC-V ISA is a good choice. It is an open ISA; it is modern; extensible; designed for a spectrum of use cases; and it has a comprehensive infrastructure of specifications, test suites, compilers, tools, simulators, libraries, operating systems, and processor and interface intellectual property (IP) cores. Its core ISA, RV32I, is a simple 32-bit integer RISC.
The present disclosure describes an FPGA-efficient implementation of the RISC-V RV32I instruction set architecture, called “GRVI”. GRVI is an austere soft processor core that focuses on using as few hardware resources as possible, which enables more cores per die, which enables more compute and memory parallelism per integrated circuit (IC).
The design goal of the GRVI core was therefore to maximize millions of instructions per second per LUT-area-consumed (MIPS/LUT). This is achieved by eliding inessential logic from each CPU core. In one embodiment, infrequently used resources, such as shifter, multiplier, and byte/halfword load/store, are cut from the CPU core. Instead, they are shared by two or more cores in the cluster, so that their overall amortized cost is reduced, and in one embodiment, at least halved.
In one embodiment, the GRVI soft processor's microarchitecture is as follows. It is a two- or three-stage pipeline (optional instruction fetch; decode; execute) with a 2R/1 W register file; two sets of operand multiplexers (operand selection and result forwarding) and registers; an arithmetic logic unit (ALU); a dedicated comparator for conditional branches and SLT (set less than); a program counter (PC) unit for I-fetch, jumps, and branches; and a result multiplexer to select a result from the ALU, return address, load data, optional shift and/or multiply.
In one embodiment, for GRVI, each LUT in the datapath was explicitly technology mapped (structurally instantiated) into FPGA 6-LUTs, and each LUT in the synthesized control unit was scrutinized. By careful technology mapping, including use of carry logic in the ALU, PC unit, and comparator, the core area and clock period may be significantly reduced.
GRVI is small and fast. In one embodiment, the datapath uses 250 LUTs and the core overall uses 320 LUTs, and it runs at up to 375 MHz in a Xilinx Kintex UltraScale (−2) FPGA. Its CPI (cycles per instruction) is approximately ˜1.3 (2 pipeline stage configuration) or ˜1.6 (3 pipeline stage configuration). Thus in this embodiment the efficiency figure of merit for the core is approximately 0.7 MIPS/LUT.
As a GRVI processor core (also herein called variously “processing core” or simply “PE” for processing element) is relatively compact, it is possible to implement many PEs per FPGA—750 in one embodiment in a 240,000 LUT Xilinx Kintex UltraScale KU040. But besides PEs, a practical computing system also needs memories and interconnects. A KU040 has 600 dual-ported 1K×36 BRAMs (block static random access memories)—one per 400 LUTs. How might all these cores and memories be organized into a useful, fast, easily programmed multiprocessor? It depends upon workloads and their parallel programming models. The present disclosure and embodiments, without limitation, particularly targets data parallel, task parallel, and process network parallel programs (SPMD (single program, multiple data) or MIMD (multi-instruction-stream, multiple data)) with relatively small compute kernels.
For system-wide data memory, it is expensive (inefficient in terms of hardware resources required) to build fast cache coherent shared memory for hundreds of cores. Also, caches consume resources better spent on computation. Thus in a preferred embodiment data caches are not required.
Another embodiment employs an uncached global shared memory design. Here BRAMs are grouped into ‘memory segments’ distributed about the FPGA; any PE or accelerator at any site on the FPGA may issue remote store and load requests, and load responses, which traverse an interconnect such as a NOC to and from the addressed memory segment. This is straightforward to build and program, but since if the PE is not memory latency tolerant, a non-local load instruction might stall the PE for 10-20 cycles or more as the load request and response traverse the interconnect and access the memory block. Thus in such embodiments, shared memory intensive workloads may execute more slowly than possible in other embodiments.
An embodiment, herein called a “Phalanx” architecture (so named for its resemblance to disciplined, cooperating arrays of troops in an ancient Greek military unit), partitions FPGA resources into small clusters of processors, accelerators, and a cluster-shared memory (“CRAM”), typically of 4 KB to 1 MB in size. Within a cluster, CRAM accesses by processor cores or accelerator cores have fixed low latency of a few cycles, and, assuming a workload's data can be subdivided into CRAM-sized working sets, memory intensive workloads may execute, in aggregate, relatively quickly.
In an embodiment targeting the 4 KB BRAMs of a Xilinx Kintex UltraScale KU040 device, Table 1 lists some CRAM configuration embodiments. A particularly effective embodiment uses the last configuration row in the table, in boldface. In this embodiment, the device is configured as 50 clusters, each cluster with 8 GRVI soft processor cores, pairwise-sharing 4 KB instruction RAMs (“IRAMs”), and together sharing a 32 KB cluster RAM.
In an embodiment targeting the 4 KB BRAMs and larger 32 KB URAMs (“UltraRAMs”) of a Xilinx Virtex UltraScale+ VU9P device, Table 2 lists some CRAM configuration embodiments. A particularly effective embodiment for that device uses the last configuration row in the table, in boldface. (Note the VU9P FPGA provides a total of 1.2M LUTs, 2160 BRAMs, 960 URAMs.) In this embodiment, the device is configured as 210 clusters, each cluster with 8 GRVI soft processor cores, pairwise-sharing 8 KB IRAMs, and together sharing a 128 KB cluster RAM.
4
4800
8
8
KB
128
KB
210
In some embodiments, the number of BRAMs and URAMs per cluster determines the number of LUTs that a cluster including those BRAMS/URAMs might use. In a KU040, twelve BRAMs correspond to 4800 6-LUTs. In an embodiment summarized in Table 1, eight PEs share 12 BRAMs. Four BRAMs are used as small 4 KB kernel program instruction memories (IRAMs). Each pair of processors share one IRAM. The other eight BRAMs form a 32 KB cluster shared memory (CRAM). By clustering each of pairs of 4 KB BRAMs together into four logical banks, and configuring the (inherently dual port) 4 KB BRAMs, each with one 16-bit-wide port and one 32-bit-wide port, a 4-way banked interleaved memory with a total of twelve ports is achieved. Four 32-bit-wide ports provide a 4-way banked interleaved memory for PEs. Each cycle, up to four accesses may be made on the four ports. The eight PEs connect to the CRAM via four 2:1 concentrators and a 4×4 crossbar. (This advantageous arrangement requires fewer than half of the LUT resources of a full 32-bit-wide 8×8 crossbar. See
In some embodiments, the remaining eight ports provide an 8-way banked interleaved memory for accelerator(s), and also form a single 256-bit wide port to load and send, or to receive and store, 32 byte messages, per cycle, to/from any NOC destination, via the cluster's Hoplite router.
To send a message, one or more PEs prepare a message buffer in CRAM. In some embodiments, the message buffer is a continuous 32 B region of the CRAM memory. In some embodiments the message buffer address is aligned to a multiple of 32 bytes, i.e. it is 32 B-aligned. Then one PE stores the system-wide address, also known as the Phalanx Address (PA), of the message destination to the cluster's NOC interface's memory mapped I/O region range. The cluster's NOC interface receives the request and atomically loads, from CRAM, a 32 B message data payload, and formats it as a NOC message, and sends it via its message-output port to the cluster's router's message-input port, into the interconnect network, and ultimately to some client of the NOC identified by a destination address of the message. The PA of the message destination encodes the NOC address (x,y) of the destination, as well as the local address (within the destination client core, which may be another compute cluster), at the destination. If the destination is a compute cluster, then the incoming message is subsequently written into that cluster's CRAM and/or is received by the accelerator(s). Note this embodiment's advantageous arrangement of the second set of CRAM ports with a total of 8×32=256 bits of memory ports, directly coupled to the NOC router input, and the use of CRAM-memory-buffered software message sends, and the use of an ultra-wide NOC router and NOC, permits unusually high bandwidth message/send receive—a single 32-bit PE can send a 32 byte message from its cluster, out into the NOC, at a peak rate of one send per cycle, and a cluster can receive one such 32 byte message every cycle.
In some embodiments, this message send mechanism also enables fast local memcpy and memset. Aligned data may be copied at 32 B per two cycles, by sending a series of 32 B messages from a source address in a cluster RAM, via its router, back to a destination address in the same cluster RAM; that is, this procedure allows a cluster circuit to “send to itself”.
In some embodiments, a cluster circuit is configured with one or more accelerator cores (also called “accelerators”). An accelerator core is typically a hardwired logic circuit, or a design-time or run-time configurable logic circuit, which unlike a processor core, is not a general purpose, instruction-executing, circuit. Rather in some embodiments, the logic circuit implemented by an accelerator core may be specialized to perform, in fixed logic, some computation specific to one more workloads.
In some embodiments wherein accelerator cores are implemented in an FPGA, the FPGA may be configured with a particular one or more accelerators optimized to speed up one or more expected workloads that are to be executed by the FPGA. In some embodiments accelerator cores communicate with the PEs via the CRAM cluster shared memory, or via direct coupling to a PE's microarchitectural ALU output, store-data, and load-data ports. Accelerators may also use a cluster router to send/receive messages to/from cluster RAMs, to/from other accelerators, or to/from memory or I/O controllers.
In some embodiments a cluster sends or receives a message in order to, without limitation, store or load a 32 B message payload to DRAM, to send/receive an Ethernet packet (as a series of messages) to/from an Ethernet NIC (network interface controller), and/or to send/receive data to/from AXI4 Stream endpoints.
In some embodiments, a cluster design includes a floorplanned FPGA layout of a cluster of 8 GRVI PEs, 12 BRAMs (4 IRAMs, 1 CRAM), 0 accelerators, local interconnect, Hoplite NOC interface, and Hoplite NOC router. In some embodiments, at design time, a cluster may be configured with more/fewer PEs and more or less IRAM and CRAM, to right-size resources to workloads.
In some embodiments, as with the GRVI soft processor core, the cluster ‘uncore’ (the logic circuits of the cluster, excluding the soft processor cores), is implemented with care to conserve LUTs. In some embodiments there are no FIFOs (first-in-first-out) buffers or elastic buffers in the design. This reduces the LUT overhead of message input/output buffering to zero. Instead, NOC ingress flow control of message sends is manifest as wait states (pipeline holds) in the PE(s) attempting to send messages. Back pressure from the NOC, through the arbitration network, to each core's pipeline clock enable, may be the critical path in the design, and in this embodiment it limits the maximum clock frequency to about 300 MHz (small NOCs) and 250 MHz (die spanning SOCs).
Some embodiments use a Hoplite router per cluster that are together composed into a Hoplite NOC. Hoplite is a configurable directional 2D torus router that efficiently implements high bandwidth NOCs on FPGAs. An embodiment of a Hoplite router has a configurable routing function and a switch with three message inputs (XI, YI, I (i.e. from a client)) and two outputs (X, Y). At least one of the output message ports serves as the client output. (From the client's perspective this is the message-input bus). Routers are composed on unidirectional X and Y rings to form a 2D torus network . . . .
A Hoplite router is simple, frugal, wide, and fast. In contrast with prior work, Hoplite routers use unidirectional, not bidirectional links; no buffers; no virtual channels; local flow control (by default); atomic message send/receive (no message segmentation or reassembly); client outputs that share NOC links; and are configurable, e.g. ultra-wide links, workload optimized routing, multicast, in-order delivery, client I/O specialization, link energy optimization, link pipelining, and floorplanning.
In some embodiments, a Hoplite router is an austere bufferless deflecting 2D torus router. To conserve LUTs, the use of a directional torus reduces a router's 5×5 crossbar to 3×3. The client output message port is infrequently used and inessential, and may be elided by reusing an inter-router link as a client output. This further simplifies the switch to 3×2. Since there are no buffers, when and if output port contention occurs, the router deflects a message to a second port. It will loop around its ring and try again later.
In some embodiments, a one-bit slice of a 3×2 switch and its registers may be technology mapped into a fracturable Xilinx 6-LUT or Altera ALM, with a one wire+LUT+FF delay critical path through a router. For die-spanning NOCs, inter-router wire delay is typically 90% of the clock period. In some embodiments, this can be reduced by using pipeline registers in the inter-router links. In some embodiments, Intel Stratix 10 HyperFlex interconnect pipeline flip-flops, not logic cluster flip-flops, implement NOC ring link pipeline registers, enabling very high frequency operation.
In some embodiments a KU040 floorplanned die-spanning 6×4 Hoplite NOC with 256-bit message payloads runs at 400 MHz and uses <3% of LUTs of the device. In some embodiments, the Hoplite NOC interconnect torus is not folded spatially, and employs extra pipeline registers in the Y rings and X rings for signals that may need to cross the full length or breadth of the die (or the multi-chip die in the case of 2.5D stacked-silicon-interconnect multi-die FPGAs). In some embodiments, link bandwidth is 100 Gb/s and the Hoplite NOC interconnect bisection bandwidth is 800 Gb/s. In some embodiments average latency from anywhere on the chip to anywhere else on the chip is about 7 cycles/17.5 ns assuming no message deflection.
Compared to other FPGA-optimized buffered virtual channel (VC) routers [5], a Hoplite NOC has an orders of magnitude better area×delay product. (Torus16, a 4×4 torus with 64-bit-flits and 2 virtual channels uses ˜38,000 LUTs and runs at 91 MHz. In an embodiment, a 4×4 Hoplite NOC of 64-bit messages uses 1230 LUTs and runs at 333-500 MHz.) In some embodiments it is cheaper to build two Hoplite NOCs than one 2-virtual-channel NOC!
The advantageous area efficiency and design of an embodiment of a Hoplite router and of an embodiment of a Hoplite NOC torus including such routers, enables high performance interconnection across the FPGA die of diverse client cores and external interface cores, and simplifies chip floorplanning and timing closure, since as long as a core can connect to some nearby router, and tolerate a few cycles of NOC latency, its particular location on the FPGA (its floorplan) does not matter very much relative to operational speed and latency.
Listing 1 is a listing of Verilog RTL that instantiates an exemplary configurable GRVI Phalanx parallel computer SOC with dimension parameters NX and NY, i.e. to instantiate the NOC and an NX×NY array of clusters and interconnect NOC routers' inputs/outputs to each cluster. (This exemplary code employs XY etc. macros to mitigate Verilog's lack of 2D array ports.) A SOC/NOC floorplan generator (not shown) produces an FPGA implementation constraints file to floorplan the SOC/NOC into a die-spanning 10×5 array of tiles.
In an embodiment, the GRVI Phalanx design tools and RTL source code are extensively parameterized, portable, and easily retargeted to different FPGA vendors, families, and specific devices.
In an embodiment, a NX=2×NY=2×NPE=8=32-PE SOC configuration of a Digilent Arty FPGA (a small Xilinx XC7A35T) achieves a clock frequency of 150 MHz and a Hoplite NOC link bandwidth of over 40 Gb/s.
An embodiment of the disclosed parallel computer, with its many clusters of soft processor cores, accelerator cores, cluster shared memories, and message passing mechanisms, and with its ready composability between processors and accelerators within and amongst clusters, provides a flexible toolkit of compute, memory, and communications capabilities that makes it easier to develop and maintain an FPGA accelerator for a parallel software workload. Some workloads will fit its mold, especially highly parallel SPMD or MIMD code with small kernels, local shared memory, and global message passing. Here, without limitation, are some parallel models that map well to the disclosed parallel computer:
In an embodiment the disclosed parallel computer may be implemented in an FPGA, so these and other parallel models may be further accelerated via custom soft processor and cluster function units; custom memories and interconnects; and custom standalone accelerator cores on cluster RAM or directly connected on the NOC.
An embodiment of the system in a Xilinx Kintex UltraScale 040 devices comprises 400 FPGA-efficient RISC-V instruction set architecture (ISA) soft processors, designated “GRVI” (Gray Research RISC-V-I) into a 10×5 torus of clusters, each cluster comprising a Hoplite router interface, 8 GRVI processing elements, a multiport, interleaved 32 KB cluster data RAM, and one or a plurality of accelerator cores. The system achieves a peak aggregate compute rate of 400×333 MHz×1 instruction/cycle=133 billion instructions per second. Each cluster can send or receive a 32B (i.e. 256b) message into/from the NOC each cycle. Each of the 10×5 clusters has a Hoplite router. The resulting Hoplite NOC is configured with 300-bit links sufficient to carry a 256-bit data payload, plus address information and other data, each clock cycle. The aggregate memory bandwidth of the processors into the cluster RAM (CRAM) is 4 ports×50 CRAMs×4B/cycle*333 MHz=266 GB/s. The aggregate memory bandwidth of the NOC and any CRAM-attached accelerators into the CRAM memories is 50 CRAMs×32B/cycle*333 MHz=533 GB/s.
In an embodiment, a number of external interfaces, e.g. without limitation 10G/25G/40G/100G Ethernet, many channels of DRAM or many channels of High Bandwidth Memory, may be attached to the system. By virtue of the NOC interconnect, any client of the NOC may send messages, at data rates exceeding 100 Gb/s, to any other client of the NOC.
The many features of embodiments of the Hoplite router and NOC, and of other embodiments of the disclosure, include, without limitation:
To illustrate an example reduction to practice of an embodiment of the disclosed system,
A massively parallel computing system is disclosed. An example embodiment, which illustrates design and operation of the system, and which is not limiting, implements a massively parallel Ethernet router and packet processor.
In this example system, a cluster-core tile, implemented in an FPGA, uses four block RAMs for the instruction RAMs 222 and eight block RAMs for the cluster-data RAM 230. This configuration enables up to four independent 32-bit reads or writes into the CRAM 230 by the processors 220 and concurrently up to eight 32-bit reads or writes into the CRAM by the accelerators 250 (if any) or by the network interface 240.
In the exemplary computing system described herein, the system comprises ten rows×five columns=50 of such multiprocessor/accelerator cluster cores, or 50×8=400 processors 220 in total. A NOC (network on chip) is used to carry data as messages between clusters, between clusters and external-interface cores (for example to load or store to external DRAM), and directly between external-interface cores. In this example, NOC messages are approximately 300 bits wide, including 288 bits of data payload (32-bit address and 256-bit data field).
The cluster core 210 also comprises a Hoplite NOC router interface 240, which connects the cluster's CRAM memory banks to the cluster's Hoplite router input, so that a message data payload read from the cluster's CRAM via one or more of its many ports may be sent (output) to another client on the NOC via the message input port on the cluster's Hoplite router, or the data payload of a message received from another NOC client via the NOC via the cluster's Hoplite router may be written into the cluster's CRAM via one or more of its many ports. In this example, the processor cores 220 share access to the cluster's CRAM with each other, with zero or more accelerator cores 250, and with the Hoplite NOC interface. Accordingly, a message received from the NOC into the local memory may be directly accessed and processed by any (or many) of the cluster's processors, and conversely the cluster's processors may prepare a message in memory and then cause it to be sent from the cluster to other client cores of the NOC via the Hoplite router 200.
In the cluster arrangement of cores 210, CRAM 230, and network interface 240 described in conjunction with
In this example, a computing cluster 210 may further comprise zero, one, or more accelerator cores 250, coupled to the other components of the cluster in various ways. An accelerator 250 may use the cluster-local interconnect network to directly read or write one or more CRAM ports. An accelerator 250 may couple to a soft processor 220, and interact with software execution on that processor, in various ways, for example and without limitation, to access registers, receive data, provide data, determine conditional-branch outcomes, through interrupts, or through processor-status-word bits. An accelerator 250 may couple to the Hoplite router interface 240 to send or receive messages. Within a cluster 210, interconnection of the processor cores 220, accelerators 250, memories 222 and 230, and Hoplite NOC interface 240 make it possible for the combination of these components to form a heterogeneous accelerated computing engine. Aspects of a workload that are best expressed as a software algorithm may be executed on one or more of the processor cores 220. Aspects that may be accelerated or made more energy efficient by expression in a dedicated logic circuit may be executed on one or more accelerator cores 250. The various components may share state, intermediate results, and messages through direct-communication links, through the cluster's shared memory 230, and via sending and receiving of messages. Across the many clusters including clusters 180 A-F of the SOC 102, different numbers and types of accelerator cores 250 may be configured. As an example, in a video special effects processing system, a first cluster 180 A (
Referring to
In an embodiment, each field of the message has a configurable bit width. Router build-time parameters MCAST, X_W, Y_W, and D_W select minimum bit widths for each field of a message and determine the overall message width MSG_W. In an embodiment, NOC links have a minimum bit width sufficient to transport a MSG_W-bit message from one router to the next router on the ring in one cycle.
Referring again to
Given an uncompressed packet in CRAM, malware-detection software executes on one or more of the cluster's soft processors 220 to scan the bytes of the message payload for particular byte sequences that exhibit characteristic signatures of specific malware programs or code strings. If potential malware is discovered, the packet is not to be retransmitted on some network port, but rather is saved to the system's DRAM memory 120 for subsequent ‘offline’ analysis.
Next, packet-routing software, run on one or more of the soft processors 220, consults tables to determine where to send the packet next. Certain fields of the packet, such as ‘time to live’, may be updated. If so configured, the packet may be recompressed by a compression routing running on one or more of the soft processors 220. Finally, the packet is segmented into one or more (exemplary) 32 byte NOC messages, and these messages are sent one by one through the cluster's Hoplite router 200, via the NOC, to the appropriate NIC client core 140. As these messages are received by the NIC via the NOC, they are reformatted within the NIC into an output packet, which the NIC transmits via its external network interface.
In this example, the computations of decompression, malware detection, compression, and routing are performed in software, possibly in a parallel or pipelined fashion, by one or more soft processors 220 in one or more computing-cluster clients 210. In alternative embodiments, any or all of these steps may be performed in dedicated logic hardware by accelerator cores 250 in the cluster.
Whereas a soft processor 220 is a program-running, instruction-executing general purpose computing core, e.g. a microprocessor or microcontroller, in contrast, an accelerator core may be, without limitation, a fixed function datapath or function unit, or a datapath and finite state machine, or a configurable or semi-programmable datapath and finite state machine. In contrast to a processor core 220, which can run arbitrary software code, an accelerator core 250 is not usually able to run arbitrary software but rather has been specialized to implement a specific function or set of functions or restricted subcomputation as needed by a particular one or more application workloads. Accelerator cores 250 may interconnect to each other or to the other components of the cluster through means without limitation such as direct coupling, FIFOs, or by writing and reading data in the cluster's CRAM 230, and may interconnect to the diverse other components of system 102 by sending and receiving messages through router 200 into the NOC 150.
In an embodiment, packet processing for a given packet takes place in one computing-cluster client 210. In alternative embodiments, multiple compute-cluster clients 210 may cooperate to process packets in a parallel, distributed fashion. For example, specific clusters 210 (e.g. clusters 180 A-F) may specialize in decompression or compression, while others may specialize in malware detection. In this case, the packet messages might be sent from a NIC to a decompression cluster 210. After decompression, the decompression cluster 210 may send the decompressed packet (as one or more messages) on to a malware scanner cluster 210. There, if no malware is detected, the malware scanner may send the decompressed, scanned packet to a routing cluster 210. There, after determining the next destination for the packet, the routing cluster 210 may send the packet to a NIC client 140 for output. There, the NIC client 140 may transmit the packet to its external network interface. In this distributed packet-processing system, in an embodiment, a client may communicate with another client via some form of direct connection of signals, or, in an embodiment, a client may communicate with another client via messages transmitted via the NOC. In an embodiment, communications may be a mixture of direct signals and NOC messages.
An embodiment of this exemplary computing system may be implemented in an FPGA as follows. Once again, the following specific example should not be construed to be limiting, but rather to illustrate an advantageous application of an embodiment disclosed herein. The FPGA device is a Xilinx Kintex UltraScale KU040, which provides a total of 300 rows×100 columns of slices of eight 6-LUTs=240,000 6-LUTs, and 600 BRAMs (block RAMs) of 36 Kb each. This FPGA is configured to implement the exemplary computing device described above, with the following specific components and parameters. A Hoplite NOC configured for multicast DOR (dimension order) routing, with NY=10 rows by NX=5 columns of Hoplite routers and with w=256+32+8+4=300-bit wide links, forms the main NOC of the system. The FPGA is floor planned into 50 router+multiprocessor/accelerator clusters arranged as rectangular tiles, and arrayed in a 10×5 grid layout, with each tile spanning 240 rows by 20 columns=4800 6-LUTs and with 12 BRAMs. The FPGA resources of a tile are used to implement a cluster-client core 210 and the cluster's Hoplite router 200. The cluster 210 has a configurable number (zero, one, or a plurality) of soft processors 220. In this example, the soft processors 220 are in-order pipelined scalar RISC cores that implement the RISC-V RV32I instruction-set architecture. Each soft processor 220 consumes about 300 6-LUTs of programmable logic. Each cluster has eight processors 220. Each cluster also has four dual-ported 4 KB BRAMs that implement the instruction memories 222 for the eight soft processors 220. Each cluster 210 also has eight dual-ported 4 KB BRAMs that form the cluster data RAM 230. One set of eight ports on the BRAM array is arranged to implement four address-interleaved memory banks, to support up to four concurrent memory accesses into the four banks by the soft processors 220. The other set of eight ports, with input and output ports each being 32 bits wide, totaling 32 bits×8=256 bits, on the same BRAM array is available for use by accelerator cores 230 (if any) and is also connected to the cluster's Hoplite router input port 202 and the Hoplite router's Y output port 204. Router-client control signals 206 (corresponding to O_V and I_RDY of
A set of memory bank arbiters and multiplexers 224, 226 manages bank access to the BRAM array from the concurrent reads and writes from the eight processors 220.
In this exemplary system, software running on one or more soft processors 220 in a cluster 210 can initiate a message send of some bytes of local memory to a remote client across the NOC. In some embodiments, a special message-send instruction may be used. In another embodiment, a regular store instruction to a special I/O address corresponding to the cluster's NOC interface controller 240 initiates the message send. The store instruction provides a store address and a 32-bit store-data value. The NOC interface controller 240 interprets this as a message-send request, to load from local CRAM payload data of 1-32 bytes at the specified local “store” address, and to send that payload data to the destination client on the NOC, at a destination address within the destination client, indicated by the store's 32-bit data value.
Three examples illustrate a method of operation of the system of
1) To send a message to another processor 220 in another cluster 210, a processor 220 prepares the message bytes in its cluster's CRAM 230, then stores (sends) the message to the receiver/destination by means of executing a store instruction to a memory mapped I/O address interpreted as the cluster's NOC interface controller 240 and interpreted by NOC interface controller 240 as a signal to perform a message send. The 32-bit store-data value encodes (in specific bit positions) the (x,y) coordinates of the destination cluster's router 200, and also the address within the destination cluster's local memory array to receive the copy of the message. The cluster's NOC interface controller 240 reads up to 32 bytes from the cluster BRAM array, formats this into a NOC message, and sends it via the cluster's Hoplite router, across the NOC, to the specific cluster, which receives the message and writes the message payload data into its CRAM 230 at the local address specified in the message.
2) To store a block of 1-32 bytes of data to DRAM through a specific DRAM channel 144, perhaps in a conventional DRAM, perhaps in a segment of an HBM DRAM device, a processor first writes the data (to be written to DRAM) to the cluster's CRAM 230, then stores (sends) the message to the DRAM by means of executing a store instruction to a memory mapped I/O address interpreted as the cluster's NOC interface controller 240, once again interpreted as a signal to perform a message send. The provided 32-bit store-data address indicates a) the store is destined for DRAM rather than the local cluster memory of some cluster, and b) the address within the DRAM array at which to receive the block of data. The NOC interface controller 240 reads the 1-32 bytes from the cluster's CRAM 230, formats this into a NOC message, and sends it via the cluster's Hoplite router 200 across the NOC to the specific DRAM channel controller 144, which receives the message, extracts the local (DRAM) address and payload data, and performs the store of the payload data to the specified DRAM address.
3) To perform a remote read of a block of 1-32 bytes of data, for example, from a DRAM channel 144, into 1-32 bytes of cluster local memory, a processor 220 prepares a load-request message, in CRAM, which specifies the address to read, and the local destination address of the data, and sends (by another memory mapped I/O store instruction to the NOC interface controller 240, signaling another message send) that message to the specific DRAM channel controller 144, over the NOC. Upon receipt by the DRAM channel controller 144, the latter performs the read request, reading the specified data from DRAM 120, then formatting a read-response message with a destination of the requesting cluster 210 and processor 220, and with the read-data bytes as its data payload. The DRAM channel controller 144 sends the read-response message via its Hoplite router 200 via the Hoplite NOC, back to the cluster 210 that issued the read, where the message payload (the read data) is written to the specified read address in the cluster's CRAM 230.
This exemplary parallel computing system is a high-performance FPGA system on a chip. Across all 5×10=50 clusters 210, 50×8=400 processor cores 220 operate with a total throughput of up to 400×333 MHz=133 billion operations per second. These processors can concurrently issue 50×4=200 memory accesses per clock cycle, or a total of 200×333 MHz=67 billion memory accesses per second, which is a peak bandwidth of 267 GB/s. Each of the 50 clusters' memories 230 also have an accelerator/NOC port which can access 32 bytes/cycle/cluster for a peak accelerator/NOC memory bandwidth of 50×32 B/cycle=1.6 KB/cycle or 533 GB/s. The total local memory bandwidth of the machine is 800 GB/s. Each link in the Hoplite NOC carries a 300-bit message, per cycle, at 333 MHz. Each message can carry a 256-bit data payload for a link payload bandwidth of 85 Gbps and a NOC bisection bandwidth of 10×85=850 Gbps.
The LUT area of a single Hoplite router 200 in this exemplary system is 300 6-LUTs for the router data path and approximately 10 LUTs for the router control/routing function. Thus the total area of this Hoplite NOC 200 is about 50×310=15,500 LUTs, or just 6% of the total device LUTs. In contrast the total area of the soft-processor cores 220 is 50×300×8=120,000 LUTs, or about half (50%) of the device LUTs, and the total area of the cluster local memory interconnect multiplexers and arbiters 224 and 226 is about 50×800=40,000 LUTs, or 17% of the device.
As described earlier, in this continuing example system, packets are processed, one by one as they arrive at each NIC, by one or more clusters. In another embodiment, the array of 50 compute clusters 210 is treated as a “Gatling gun” in which each incoming packet is sent as a set of NOC messages to a different, idle cluster. In such a variation, clusters may be sent new packets to process in a strict round robin order, or packets may be sent to idle clusters even as other clusters take more time to process larger or more-complex packets. On a 25G (25 Gbps bandwidth) network, a 100 byte (800 bit) message may arrive at a NIC every (800 bits/25 e9 b/s)=32 ns. As each received packet is forwarded (as four 32-byte NOC messages) from a NIC to a specific cluster 210, that cluster, one of 50, works on that packet exclusively for up to 50 packet-arrival-intervals before it must finish up and prepare to receive its next packet. A cluster-packet processing-time interval of 50×32 ns=1600 ns, or 1600 ns/3 ns/cycle=533 clock cycles, and with eight soft processors 220 the cluster can devote 533 cycles×8 processors×up to 1 instruction/cycle, e.g., up to 4200 instructions of processing on each packet. In contrast, a conventional FPGA system is unable to perform so much general purpose programmable computation on a packet in so little time. For applications beyond network-packet compression and malware detection, throughput can be can be further improved by adding dedicated accelerator-function core(s) 250 to the soft processors 220 or to the cluster 210.
In addition to message-passing-based programming models, an embodiment of the system is also an efficient parallel computer to host data-parallel-programming models such as that of OpenCL. Each parallel kernel invocation may be scheduled to, or assigned to, one or more of the cluster circuits 210 in a system, wherein each thread in an OpenCL workgroup is mapped to one core 220 within a cluster. The classic OpenCL programming pattern of 1) reading data from an external memory into local/workgroup memory; then 2) processing it locally, in parallel, across a number of cores; then 3) writing output data back to external memory, maps well to the architecture described in conjunction with
In summary, in this example, a Hoplite NOC facilitates the implementation of a novel parallel computer by providing efficient computing cluster client cores 210 of multiple soft processors 220 and accelerators 250 composed along with the cluster's CRAM 230, and with efficient interconnection of its diverse clients—computing cluster cores, DRAM channel-interface cores, and network interface cores. The NOC makes it straightforward and efficient for computation to span compute clusters, which communicate by sending messages (ordinary or multicast messages). By efficiently carrying extreme bandwidth data traffic to any site in the FPGA, the NOC simplifies the physical layout (floor planning) of the system. Any client in the system, at any site in the FPGA, can communicate at high bandwidth with any NIC interface or with any DRAM channel interface. This capability may be particularly advantageous to fully utilize FPGAs that integrate HBM DRAMs and other die-stacked, high-bandwidth DRAM technologies. Such memories present eight or more DRAM channels, 128-bit wide data, at 1-2 Gbps (128-256 Gbps/channel). Hoplite NOC configurations, such as demonstrated in this exemplary computing system, efficiently enable a core, from anywhere on the FPGA die, to access any DRAM data on any DRAM channel, at full memory bandwidth. No available systems or networking technologies or architectures, implemented in an FPGA device, can provide this capability, with such software programmable flexibility, at such high data rates.
While enabled, and as often as every clock cycle, the routing circuit 350 examines the input messages 302, 304, and 306 if present, to determine which of the XI, YI, and I inputs should route to which X and Y outputs, and to determine the values of the validity outputs defined herein. In an embodiment, the routing circuit 350 also outputs router switch-control signals comprising X-multiplexer select 354 and Y-multiplexer select 352. In alternative embodiments, switch-control signals may comprise different signals including, without limitation, input- or output-register clock enables and switch-control signals to introduce or modify data in the output messages 310 and 312.
While enabled, and as often as every clock cycle, the switch circuit 330 determines the first- and second-dimension output-message values 310 and 312, on links X and Y, as a function of the input messages 302, 304, and 306 if present, and as a function of switch-control signals 352, 354 received from the routing circuit 350.
Still referring to
Referring to
To illustrate an example reduction to practice of an embodiment of the above-described system,
In an embodiment, the parallel computer is experienced, by parallel application software workloads running upon it, as a shared memory software thread plus a set of memory mapped I/O programming interfaces and abstractions. This section of the disclosure provides, without limitation, an exemplary set of programming interfaces to illustrate how software can control the machine and direct it to perform various disclosed operations, such as a processor in one cluster preparing and sending a message to another cluster's CRAM 230.
Exemplary machine parameters: In an embodiment,
In an embodiment, for a Xilinx KCU105 FPGA, NX=5 NY=10 NPEC=8 IRAM_SIZE=4K NBANKS=4 CRAM_SIZE=32K NPE=400.
In an embodiment, for a Xilinx XC7A35T FPGA, NX=2 NY=2 NPEC=8 IRAM_SIZE=4K NBANKS=4 CRAM_SIZE=32K NPE=32.
In an embodiment, for a Xilinx XCVU9P FPGA, NX=7 NY=27 NPEC=8 IRAM_SIZE=8K NBANKS=4 CRAM_SIZE=128K NPE=1680 (i.e. 1680 processor cores in all).
Addressing:
In an embodiment, all on-chip instruction and data RAM share portions of the same non-contiguous Phalanx address (PA) space. Within a cluster, a local address specifies a resource such as CRAM address or an accelerator control register. Whereas at the Phalanx SOC scale, Phalanx addresses are used to identify where to send messages, i.e. message destination i.e. destination cluster and local address within that cluster.
Within a cluster, a processor or accelerator core can directly read and write its own CRAM_SIZE cluster CRAM. In an embodiment where CRAM_SIZE is 32 KB, each cluster receives a 64 KB portion of PA space. Any cluster resources associated with cluster (x,y) are at PA 00xy0000-00xyFFFF (hexadecimal—herein the “0x” prefix denoting hexadecimal may be elided to avoid confusion with cluster coordinate (x,y)).
Instructions:
In an embodiment, within a cluster, from the perspective of one processor core, instructions live in an instruction RAM (IRAM) at local text address 0000. The linker links program .text to start at 0000. The boot (processor core reset) address is 0000. Each core only sees IRAM_SIZE of .text so addresses in this address space wrap modulo IRAM_SIZE. Instruction memory is not readable (only executable), and may only be written by sending messages (new instructions in message payload data) to the .text address. In an embodiment, the PA of (x,y).iram[z] is 00xyz000 for z in [0 . . . 3]. APE must be held in reset while its IRAM is being updated. See also the cluster control register description, below.
IRAM initialization examples:
Data:
In an embodiment with CRAM_SIZE=32K, within a cluster, data lives in a shared cluster RAM (CRAM) starting at local data address 8000. All cores in a cluster share the same CRAM. The linker links data sections .data, .bss, etc. to start at 8000. Data address accesses (load/store) wrap modulo CRAM_SIZE. Byte/halfword/word loads and stores must be naturally aligned, and are atomic (do not tear). The RISC-V atomic instructions LR/SC (“load reserved” and “store conditional”) are implemented by the processors and enable robust implementation of thread-safe locks, semaphores, queues, etc.
CRAM addressing: the PA of cluster (x,y)'s CRAM is 00xy8000.
To send a message, i.e. to copy one MSG_SIZE-aligned MSG_SIZE block of CRAM at local address AAAA to another MSG_SIZE-aligned block of CRAM in cluster (x,y) at local address BBBB with AAAA and BBBB each in 8000-FFFF, issue a store instruction: sw 00xyBBBB,8000AAAA.
The memory mapped I/O cluster NOC interface controller address range is 0x80000000-0x8000FFFF and so this exemplary store is interpreted as a message send request. In response, the cluster's NOC interface fetches the 32 byte message data payload from address AAAA in the cluster's CRAM, formats it as a NOC message destined for the cluster (or other NOC client) at router (x,y) and local address at that cluster of BBBB, and sends the message into the NOC. Later it is delivered by the NOC, to the second cluster with router (x,y), and stored to the second cluster's CRAM at address BBBB.
Cluster Control:
In an embodiment, a cluster control register (“CCR”) manages the operation of the cluster. The PA of the CCR of cluster (x,y) is 00xy4000:
To write to a cluster (x,y)'s CCR, first store the new CCR data to local RAM at a MSG_SIZE-aligned address A, then issue sw 00xy4000,80000000(A).
In an embodiment, when a GRVI receives an interrupt via the CCR interrupt mechanism, it performs an interrupt sequence. This is defined as interrupt::=jal x30,0x10(x0), a RISC-V instruction that transfers control to address 00000010 and saves the interrupt return address in dedicated interrupt return address register x30.
In an embodiment, a PE must be held in reset while its IRAM is written.
Memory Mapped I/O:
In an embodiment, I/O addresses start at 0x80000000. The following memory address ranges represent memory mapped I/O resources:
Processor ID:
In an embodiment, each PE carries a read-only 32-bit extended processor ID register called a XID, of the format 00xyziii (8 hexadecimal digits):
For example, a system with NX=1,NY=3,NPEC=2 has 6 PEs with 6 XIDs:
In an embodiment, each PE's XID is obtained from its RISC-V register x31.
Phalanx Configuration:
PHXID (Phalanx ID). In an embodiment, each Phalanx has a memory mapped I/O PHXID, of the format Mmxyziii (8 hexadecimal digits) that reports the Phalanx system build parameters:
Using These Exemplary Interfaces:
With these interfaces disclosed, it is now apparent how a software workload or subroutine, loaded into an IRAM, performs its part of the overall parallel program that spans the whole parallel computer. In a non-limiting example, each processor core will:
An Example Parallel Program Using these Interfaces:
This section of the disclosure provides, without limitation, an exemplary RISC-V assembler+C program to further illustrate how a parallel computation may be implemented in an embodiment of the disclosed parallel computer. Once a processor has booted and has performed C runtime initialization:
The following three RISC-V assembly code and C code listings provide an exemplary implementation of this message-passing orchestrated parallel program.
In this example, pa.S implements the startup, C runtime initialization code, and Phalanx addressing helper code, in assembly:
In this continuing example, run.c implements the administrator process and worker process logic. Execution begins with the ‘run’ function which determines whether this process should run as administrator or worker, depending upon its processor ID.
In this continuing example, thoth.c is a library which implements a simple Thoth [4] message passing library, with functions send/receive/receiveAll/reply:
Method to Send a Wide Message, Atomically, in Software, from One Processor to Another in a Different Cluster.
As illustrated in the prior exemplary parallel program, and in the flowchart
The first step is for one or more processor cores 220 or accelerator cores 250 to write the message data payload bytes to the cluster CRAM 230. (Step 502.)
Note that in some embodiments, a parallel application program may take advantage of a plurality of processor cores in a cluster, by having multiple cores run routines that contribute partial data to one or more message buffers in CRAM to be transmitted.
In the above examples, the library function sendMsg( ) is implemented in five lines of RISC-V assembly in file pa.S/lines 28-32. This code takes two 32-bit operands in registers a0 and a1; a0 is source address, in the processor's cluster's CRAM, of the 32 byte message buffer to send, and a1 is the destination address (a Phalanx address) of some router and client core (usually a computing cluster) elsewhere on the NOC, as well as a local resource address relative to that, of where to store the copy of the message when it arrives.
The assembly implementation of sendMsg( ) performs a memory-mapped I/O (MMIO) store to the process core's cluster's NOC interface 240. This occurs because one of the operands (here a0) is turned into (0x80000000|a0), and this is decoded by the cluster address decoder (not shown in
The NOC interface then formats up a NOC message 398 from this data and the message destination address obtained from the MMIO store (originally passed in by software in register a1 in the msgSend assembly code above). In an embodiment the Phalanx address of this destination is PA=00xyaaaa, i.e. to send the message to NOC router at coordinates (PA.x,PA.y) and deliver the up-to-32 bytes message to the 16-bit local address PA.aaaa in that cluster. Thus the formatted message 398 comprises these fields: msg={v=1, mx=?, my=?, x=PA.x, y=PA.y, data={addr=PA.aaaa, CRAM output regs}}. The multicast flags msg.mx and msg.my are usually 0 because most message sends are point-to-point, but a NOC interface ‘message send’ MMIO store can also be row-multicast (msg.mx=1), column-multicast (msg.my=1), or broadcast (msg.mx=1, msg.my=1) by supplying particular distinguished ‘multicast’ x and y destination coordinates (in some embodiments, PA.x=15 and PA.y=15, respectively). In some embodiments it is possible to multicast to an arbitrary row or an arbitrary column of NOC routers and their client cores. (Step 510.)
Having formatted a message 398 the NOC interface offers it to the router on message-output bus 202, and awaits a signal on router control signals 206 indicating the router (and NOC) have accepted the message and it is on its way to delivery somewhere. (Step 512.) At this point, the NOC interface is ready to accept another MMIO to perform another message store on behalf of the original processor core or some other processor core in the cluster.
After the NOC accepts the message, the NOC is responsible to transporting the message to a router with matching destination coordinates (msg.x, msg.y). Depending upon the design of the NOC interconnection network, this may take 0, 1, or many cycles of operation. At some time later, the message arrives and is output on the destination router (msg.x, msg.y)'s output port and is available on the destination cluster's message-input bus 204. (Step 514.)
The destination cluster's NOC interface 240 decodes the local address component (here, msg.data.addr==PA.aaaa) to determine in what local resource, if any, into which to write the 32 byte data payload. PA.aaaa may designate, without limitation, an address in a local CRAM, or one of that cluster's IRAMs, or a register or memory in an accelerator core. If it is a local CRAM address, the 32 byte message data payload is written to the destination cluster's CRAM in one cycle, by means of, in this embodiment, eight 32-bit stores to the eight banks of address interleaved memory ports depicted on the right hand side of CRAM 230. (Step 516.)
This mechanism of preparing message buffers to be sent in CRAM, and then reading and writing and carrying extremely wide (here, eight machine words, 32 bytes) message payload data, atomically, has several advantages over prior art message send mechanisms them atomically in one cycle each. By staging messages to CRAM, which in some embodiments is uniformly accessible to the processor cores and accelerator cores of a cluster, these agents may cooperatively prepare messages to be sent and to process messages that have been received. Since messages are read from a CRAM in one cycle, and written to a destination in one cycle, messages are sent/received atomically, with no possibility of partial writes, torn writes, or interleaved writes from multiple senders to a common destination message buffer. All the bytes arrive together.
In some embodiments, the message buffers may be written by a combination of processor cores and accelerator cores, both coupled to ports on the CRAM. In some embodiments, one or more accelerators in a cluster may write data to message buffers in CRAM. In some embodiments, one or more accelerator cores in a cluster may signal the NOC interface to begin to send a message. In some embodiments, one or more accelerator cores may perform memory-mapped I/O causing the NOC interface to begin to send a message.
Metcalfe's Law states that the value of a telecommunications network is proportional to the square of the number of connected users of the system. Similarly the value of a NOC and the FPGA that implements it is a function of the number and diversity of types of NOC client cores. With this principle in mind, the design philosophy and prime aspiration of the NOC disclosed herein is to “efficiently connect everything to everything.”
Without limitation, many types of client cores may be connected to a NOC. Referring to
One key class of external devices to interface to an FPGA NOC is a memory device. In general, a memory device may be volatile, such as static RAM (SRAM) or dynamic RAM (DRAM), including double data rate (DDR) DRAM, graphics double data rate (GDDR), quad data rate (QDR) DRAM, reduced latency DRAM (RLDRAM), Hybrid Memory Cube (HMC), WideIO DRAM, and High Bandwidth Memory (HBM) DRAM. Or a memory may be non-volatile, such as ROM, FLASH, phase-change memory, or 3DXPoint memory. Usually there is one memory channel per device or bank of devices (e.g. a DRAM DIMM memory module), but emerging memory interfaces such as HMC and HBM provide many high-bandwidth channels per device. For example, a single HBM device (die stack) provides eight channels of 128 signals at a signaling rate of 1-2 Gbps/signal.
FPGA vendor libraries and tools provide external-memory-channel-controller interface cores. To interconnect such a client core to a NOC, i.e., to interconnect the client to a router's message input port and a message output port, one can use a bridge circuit to accept memory transaction requests (e.g., load, or store, a block of bytes) from other NOC clients and present them to the DRAM channel controller, and vice versa, to accept responses from the memory channel controller, format them as NOC messages, and send them via the router to other NOC clients.
The exemplary parallel packet-processing system disclosed herein describes a NOC client that may send a DRAM store message to a DRAM controller client core to store one byte or many bytes to a particular address in RAM, or may send a DRAM load request message to cause the DRAM channel client to perform a read transaction on the DRAM, then transmit back over the NOC the resulting data to the target (cluster, processor) identified in the request message.
As another example, the exemplary FPGA SOC described above in conjunction with
An embodiment of the area-efficient NOC disclosed herein makes possible a system that allows any client core at any site in the FPGA, connected to some router, to access any external memory via any memory-channel-controller-client core. To fully utilize the potential bandwidth of an external memory, one may implement a very wide and very fast NOC. For example, a 64-bit DDR4 2400 interface can transmit or receive data at up to 64-bits times 2.4 GHz=approximately 150 Gbps. A Hoplite NOC of channel width w=576 bits (512 bits of data and 64 bits of address and control) running at 333 MHz can carry up to 170 Gbps of data per link. In an FPGA with a pipelined interconnect fabric such as Altera HyperFlex, a 288-bit NOC of 288-bit routers running at 667 MHz also suffices.
In some embodiments, multiple banks of DRAM devices interconnected to the FPGA by multiple DRAM channels are employed to provide the FPGA SOC with the necessary bandwidth to meet workload-performance requirements. Although it is possible for the multiple external DRAM channels to be aggregated into a single DRAM controller client core, coupled to one router on the NOC, this may not provide the other client cores on the NOC with full-bandwidth access to the multiple DRAM channels. Instead, an embodiment provides each external DRAM channel with its own full-bandwidth DRAM channel-controller client core, each coupled to a separate NOC router, affording highly concurrent and full-bandwidth ingress and egress of DRAM request messages between the DRAM controller client cores and other clients of the NOC.
In some use cases, different memory-request NOC messages may use different minimum-bit-width messages. For example, in the exemplary parallel packet processing FPGA SOC described above in conjunction with
Alternatively, in other embodiments of the system, a system designer may elect to implement an SOC's DRAM memory system by instantiating in the design two parallel NOCs, a 300-bit-wide NOC and a 64-bit-wide NOC, one to carry messages with a 32 byte data payload, and the second to carry messages without such a data payload. Since the area of a Hoplite router is proportional to the bit width of its switch data path, a system with a 300-bit NOC and an additional 64-bit NOC requires less than 25% more FPGA resources than a system with one 300-bit NOC alone.
In this dual-NOC example, a client core 210 that issues DRAM-load messages is a client of both NOCs. That is, the client core 210 is coupled to a first, 300-bit-message NOC router and is also coupled to a second, 64-bit-message NOC router. An advantage of this arrangement of clients and routers is that the shorter DRAM-load-request messages may traverse their own NOC, separately, and without contending with, DRAM-store and DRAM-load-response messages that traverse their NOC. As a result, a greater total number of DRAM transaction messages may be in flight across the two NOCs at the same time, and therefore a higher total bandwidth of DRAM traffic may be served for a given area of FPGA resources and for a given expenditure of energy.
In general, the use of multiple NOCs in a system, and the selective coupling of certain client cores to certain routers of multiple NOCs, can be an advantageous arrangement and embodiment of the disclosed routers and NOCs. In contrast, in conventional NOC systems, which are much less efficient, the enormous FPGA resources and energy consumed by each NOC makes it impractical to impossible to instantiate multiple parallel NOCs in a system.
To best interface an FPGA SOC (and its many constituent client cores) to a High Bandwidth Memory (HBM) DRAM device, which provides eight channels of 128-bit data at 1-2 GHz, a system design may use, for example, without limitation, eight HBM channel-controller-interface-client cores, coupled to eight NOC router cores. A NOC with 128-Gbps links suffices to carry full-bandwidth memory traffic to and from HBM channels of 128 bits operating at 1 GHz.
Another type of die-stacked, high-bandwidth DRAM memory is Hybrid Memory Cube. Unlike HBM, which employs a very wide parallel interface, HMC links, which operate at speeds of 15 Gbps/pin, use multiple high-speed serial links over fewer pins. An FPGA interface to an HMC device, therefore, uses multiple serdes (serial/deserializer blocks) to transmit data to and from the HMC device, according to an embodiment. Despite this signaling difference, considerations of how to best couple the many client cores in an FPGA SOC to a HMC device, via a NOC, are quite similar to the embodiment of the HBM system described above. The HMC device is logically accessed as numerous high-speed channels, each typically of 64 bits wide. Each such channel might employ an HBM channel-controller-interface client core to couple that channel's data into the NOC to make the remarkable total memory bandwidth of the HMC device accessible to the many client cores arrayed on the NOC.
A second category of external-memory device, nonvolatile memory (NVM), including FLASH and next generation 3D XPoint memory, generally runs memory-channel interfaces at lower bandwidths. This may afford the use of a less-resource-intensive NOC configured with lower-bandwidth links, according to an embodiment. A narrower NOC comprising narrower links and correspondingly smaller routers, e.g., w=64 bits wide, may suffice.
Alternatively, a system may comprise an external NVM memory system comprising a great many NVM devices, e.g., a FLASH memory array, or a 3D XPoint memory array, packaged in a DIMM module and configured to present a DDR4-DRAM-compatible electrical interface. By aggregating multiple NVM devices together, high-bandwidth transfers to the devices may be achieved. In this case, the use of a high bandwidth NVM-channel-controller client core and a relatively higher-bandwidth NOC and NOC routers can provide the NOC's client cores full-bandwidth access to the NVM memory system, according to an embodiment.
In a similar manner, other memory devices and memory systems (i.e., compositions and arrangements of memory devices), may be interfaced to the FPGA NOC and its other clients via one or more external-memory-interface client cores, according to an embodiment.
Another category of important external interfaces for a modern FPGA SOC is a networking interface. Modern FPGAs directly support 10/100/1000 Mbps Ethernet and may be configured to support 10G/25G/40G/100G/400G bps Ethernet, as well as other external-interconnection-network standards and systems including, without limitation, Interlaken, RapidIO, and InfiniBand.
Networking systems are described using OSI reference-model layers, e.g., application/presentation/session/transport/network/data link/physical (PHY) layers. Most systems implement the lower two or three layers of the network stack in hardware. In certain network-interface controllers, accelerators, and packet processors, higher layers of the network stack are also implemented in hardware (including programmable logic hardware). For example, a TCP Offload Engine is a system to offload processing of the TCP/IP stack in hardware, at the network interface controller (NIC), instead of doing the TCP housekeeping of connection establishment, packet acknowledgement, check summing, and so forth, in software, which can be too slow to keep up with very-high-speed (e.g., 10 Gbps or faster) networks.
Within the data-link layer of an Ethernet/IEEE 802.3 system is a MAC (media-access-control circuit). The MAC is responsible for Ethernet framing and control. It is coupled to a physical interface (PHY) circuit. In some FPGA systems, for some network interfaces, the PHY is implemented in the FPGA itself. In other systems, the FPGA is coupled to a modular transceiver module, such as SFP+ format, which, depending upon the choice of module, transmits and receives data according to some electrical or optical interface standard, such as BASE-R (optical fiber) or BASE-KR (copper backplane).
Network traffic is transmitted in packets. Incoming data arrives at a MAC from its PHY and is framed into packets by the MAC. The MAC presents this framed packet data in a stream, to a user logic core, typically adjacent to the MAC on the programmable logic die.
In a system comprising the disclosed NOC, by use of an external-network-interface-controller (NIC) client core coupled to a NOC router, other NOC client cores located anywhere on the device, may transmit (or receive) network packets as one or more messages sent to (received from) the NIC client core, according to an embodiment.
Ethernet packets come in various sizes—most Ethernet frames are 64-1536 bytes long. Accordingly, to transmit packets over the NOC, it is beneficial to segment a packet into a series of one or more NOC messages. For example, a large 1536-Byte Ethernet frame traversing a 256-bit-wide NOC could require 48 256-bit messages to be conveyed from a NIC client core to another NOC client core or vice versa. Upon receipt of a packet (composed of messages), depending upon the packet-processing function of a client core, the client may buffer the packet in in-chip or external memory for subsequent processing, or it may inspect or transform the packet, and subsequently either discard it or immediately retransmit it (as another stream of messages) to another client core, which may be another NIC client core if the resulting packet should be transmitted externally.
To implement an embodiment of a Hoplite router NOC for interfacing to NIC client cores that transmit a network packet as a series of NOC messages, a designer can configure the Hoplite NOC routers for in-order delivery. An embodiment of the basic Hoplite router implementation, disclosed previously herein and by reference, does not guarantee that a sequence of messages M1, M2, sent from client core C1 to client core C2, will arrive in the order that the messages were sent. For example, upon sending messages M1 and M2 from client C11 at router (1,1) to client C33 at router (3,3), it may be that when message M1 arrives on the X-message input at intermediate router (3,1) via the X ring [y=1], and attempts to route to next to the router (3,2) on the Y ring [x=3], at that same moment a higher-priority input on router (3,1)'s YI input is allocated the router's Y output. Message M1, therefore, deflects to router (3,1)'s X output, and traverses the X ring [y=1] to return to router (3,1) and to reattempt egress on the router's Y output port. Meanwhile, the message M2 arrives at router (3,1) and later arrives at router (3,3) and is delivered to the client (3,3), which is coupled to the router (3,3). Message M1 then returns to router (3,1), is output on this router's Y-message output port, and is delivered to the client (3.3) of router (3,3). Therefore, the messages were sent in the order M1 then M2, but were received in the reverse order M2 then M1. For some use cases and workloads, out-of-order delivery of messages is fine. But for the present use case of delivering a network packet as a series of messages, it may be burdensome for clients to cope with out-of-order messages because a client is forced to first “reassemble” the packet before it can start to process the packet.
Therefore, in an embodiment, a Hoplite router, which has a configurable routing function, may be configured with a routing function that ensures in-order delivery of a series of messages between any specific source router and destination router. In an embodiment, this configuration option may also be combined with the multicast option, to also ensure in-order multicast delivery. In an embodiment, the router is not configurable, but it nevertheless is configured to implement in-order delivery.
Using an embodiment of the in-order message-delivery method, it is straightforward to couple various NIC client cores 140 (
Many different external-network-interface core clients may be coupled to the NOC. A NIC client 140 may comprise a simple PHY, a MAC, or a higher-level network-protocol implementation such as a TCP Offload Engine. In an embodiment, the PHY may be implemented in the FPGA, in an external IC, or may be provided in a transceiver module, which may use electrical or optical signaling. In general, the NOC router and link widths can be configured to support full-bandwidth operation of the NOC for the anticipated workload. For 1 Gbps Ethernet, almost any width and frequency NOC will suffice, whereas for 100 Gbps Ethernet, a 64-Byte packet arrives at a NIC approximately every 6 ns; therefore, to achieve 100 Gbps bandwidth on the NOC, wide, fast routers and links, comparable to those disclosed earlier for carrying high-bandwidth DRAM messages. For example, a 256-bit-wide NOC operating at 400 MHz, or a 512-bit-wide NOC operating at 200 MHz, is sufficient to carry 100 Gbps Ethernet packets at full bandwidth between client cores.
An embodiment of an FPGA system on a chip comprises a single external network interface, and, hence, a single NIC client core on the NOC. Another embodiment may use multiple interfaces of multiple types. In an embodiment, a single NOC is adequate to interconnect these external-network-interface client cores to the other client cores on the NOC. In an embodiment, NIC client cores 140 may be connected to a dedicated high-bandwidth NOC for ‘data-plane’ packet routing, and to a secondary lower-bandwidth NOC for less-frequent, less-demanding ‘control-plane’ message routing.
Besides the various Ethernet network interfaces, implementations, and data rates described herein, many other networking and network-fabric technologies, such as RapidIO, InfiniBand, FibreChannel, and Omni-Path fabrics, each benefit from interconnection with other client cores over a NOC, using the respective interface-specific NIC client core 140, and coupling the NIC client core to its NOC router. Once an external-network-interface client core is added to the NOC, it may begin to participate in messaging patterns such as maximum-bandwidth direct transfers from NIC to NIC, or NIC to DRAM, or vice versa, without requiring intervening processing by a (relatively glacially slow) processor core and without disturbing a processor's memory hierarchy.
In an embodiment, a NOC may also serve as network switch fabric for a set of NIC client cores. In an embodiment, only some of the routers on the NOC have NIC client cores; other routers may have no client inputs or outputs. In an embodiment, these “no-input” routers can use the advantageous lower-cost NOC router-switch circuit and technology-mapping efficiencies described by reference. In an embodiment that implements multicast fanout of switched packets, the underlying NOC routers may also be configured to implement multicast routing, so that as an incoming packet is segmented by its NIC client core into a stream of messages, and these messages are sent into the NOC, the message stream is multicast to all, or to a subset, of the other NIC client cores on the NOC for output upon multiple external-network interfaces.
Another important external interface to couple to the NOC is the PCI Express (PCIe) interface. PCIe is a high-speed, serial, computer-expansion bus that is widely used to interconnect CPUs, storage devices, solid-state disks, FLASH storage arrays, graphics-display devices, accelerated network-interface controllers, and diverse other peripherals and functions.
Modern FPGAs comprise one or more PCIe endpoint blocks. In an embodiment, a PCIe master or slave endpoint is implemented in an FPGA by configuring an FPGA's PCIe endpoint block and configuring programmable logic to implement a PCIe controller. In an embodiment, programmable logic also implements a PCIe DMA controller so that an application in the FPGA may issue PCIe DMA transfers to transfer data from the FPGA to a host or vice-versa.
In an embodiment, an FPGA PCIe controller, or a PCIe DMA controller, may be coupled to a NOC by means of a PCIe interface client core, which comprises a PCIe controller and logic for interfacing to a NOC router. A PCIe interface client core enables advantageous system use cases. In an embodiment, any client core on the NOC may access the PCIe interface client core, via the NOC, by sending NOC messages that encapsulate PCI Express read and write transactions. Therefore, recalling the prior exemplary network-packet-processing system described above in conjunction with
In an embodiment, in addition to facilitating remote single-word read or write transactions, external hosts and on-die client cores may utilize a PCIe DMA (direct memory access) engine capability of a PCIe interface client core to perform block transfers of data from host memory, into the PCIe interface client, and then sent via the NOC to a specific client core's local memory. In an embodiment, the reverse is also supported—transferring a block of data from a specific client core's memory, or vice-versa, from the memory of a specific client core on the NOC, to the PCIe interface client core, and then as a set of PCIe transaction messages, to a memory region on a host or other PCIe-interconnected device.
Recalling, as described above, that a NOC may also serve as network switch fabric for a set of NIC client cores, in the same manner, in an embodiment, a NOC may also serve as a PCIe switch fabric for a set of PCIe client cores. As external PCIe transaction messages reach a PCIe interface client core, they are encapsulated as NOC messages and sent via the NOC to a second PCIe interface client core, and then are transmitted externally as PCIe transaction messages to a second PCIe attached device. As with the network switch fabric, in an embodiment a PCIe switch fabric may also take advantage of NOC multicast routing to achieve multicast delivery of PCIe transaction messages.
Another important external interface in computing devices is SATA (serial advanced technology attachment), which is the interface by which most storage devices, including hard disks, tapes, optical storage, and solid-state disks (SSDs), interface to computers. Compared to DRAM channels and 100 Gbps Ethernet, the 3/6/16 Gbps signaling rates of modern SATA are easily carried on relatively narrow Hoplite NOC routers and links. In an embodiment, SATA interfaces may be implemented in FPGAs by combining a programmable-logic SATA controller core and an FPGA serdes block. Accordingly, in an embodiment, a SATA interface Hoplite client core comprises the aforementioned SATA controller core, serdes, and a Hoplite router interface. A NOC client core sends storage-transfer-request messages to the SATA interface client core, or in an embodiment, may copy a block of memory to be written or a block of memory to be read, to/from a SATA interface client core as a stream of NOC messages.
Besides connecting client cores to specific external interfaces, a NOC can provide an efficient way for diverse client cores to interconnect to, and exchange data with, a second interconnection network. Here are a few non-limiting examples. In an embodiment, for performance scalability reasons, a very large system may comprise a hierarchical system of interconnects such as a plurality of secondary interconnection networks that themselves comprise, and are interconnected by, a NOC into an integrated system. In an embodiment, these hierarchical NOCs routers may be addressed using 3D or higher-dimensional coordinates, e.g., router (x,y,i,j) is the (i,j) router in the secondary NOC found on the global NOC at global NOC router (x,y). In an embodiment, a system may be partitioned into separate interconnection networks for network management or security considerations, and then interconnected, via a NOC, with message filtering between separate networks. In an embodiment, a large system design may not physically fit into a particular FPGA, and, therefore, is partitioned across two or more FPGAs. In this example, each FPGA comprises its own NOC and client cores, and there is a need for some way to bridge sent messages so that clients on one NOC may conveniently communicate with clients on a second NOC. In an embodiment, the two NOCs in two different devices are bridged; in another embodiment, the NOCs segments are logically and topologically one NOC, with message rings extending between FPGA devices and messages circulating between FPGAs using parallel, high-speed I/O signaling, now available in modern FPGAs, such as Xilinx RXTXBITSLICE IOBs. In an embodiment, a NOC may provide a high-bandwidth “superhighway” between client cores, and the NOC's client cores themselves may have constituent subcircuits interconnected by other means. A specific example of this is the multiprocessor/accelerator-compute-cluster client core diagrammed in
In each of these examples, clients of these varied interconnect networks may be advantageously interconnected into an integrated whole by means of treating the various subordinate interconnection networks themselves as an aggregated client core of a central Hoplite NOC. As a client core, the subordinate interconnection network comprises a NOC interface by which means it connects to a Hoplite NOC router and sends and receives messages on the NOC. In
Now turning to the matter of interconnecting together as many internal (on-chip) resources and cores together as possible via a NOC, one of the most important classes of internal-interface client cores is a “standard-IP-interface” bridge client core. A modern FPGA SOC is typically a composition of many prebuilt and reusable “IP” (intellectual property) cores. For maximal composability and reusability, these cores generally use industry-standard peripheral interconnect interfaces such as AXI4, AXI4 Lite, AXI4-Stream, AMBA AHB, APB, CoreConnect, PLB, Avalon, and Wishbone. In order to connect these preexisting IP cores to one another and to other clients via a NOC, a “standard-IP-interface” bridge client core is used to adapt the signals and protocols of the IP interface to NOC messages and vice versa.
In some cases, a standard-IP-interface bridge client core is a close match to the NOC messaging semantics. An example is AXI4-Stream, a basic unidirectional flow-controlled streaming IP interface with ready/valid handshake signals between the master, which sends the data, and the slave, which receives the data. An AXI4-Stream bridge NOC client may accept AXI4-Stream data as a slave, format the data into a NOC message, and send the NOC message over the NOC to the destination NOC client, where (if the destination client is also an AXI4-Stream IP bridge client core) a NOC client core receives the message and provides the stream of data, acting as an AXI4-Stream master, to its slave client. In an embodiment, the NOC router's routing function is configured to deliver messages in order, as described above. In an embodiment, it may be beneficial to utilize an elastic buffer or FIFO to buffer either incoming AXI4-Stream data before it is accepted as messages on the NOC (which may occur if the NOC is heavily loaded), or to use a buffer at the NOC message output port to buffer the data until the AXI4-Stream consumer becomes ready to accept the data. In an embodiment, it is beneficial to implement flow control between source and destination clients so that (e.g., when the stream consumer negates its ready signal to hold off stream-data delivery for a relatively long period of time) the message buffer at the destination does not overflow. In an embodiment, flow control is credit based, in which case the source client “knows” how many messages may be received by the destination client before its buffer overflows. Therefore, the source client sends up to that many messages, then awaits return credit messages from the destination client, which return credit messages signal that buffered messages have been processed and more buffer space has freed up. In an embodiment, this credit return message flows over the first NOC; in another embodiment, a second NOC carries credit-return messages back to the source client. In this case, each AXI4-Stream bridge client core is a client of both NOCs.
The other AXI4 interfaces, AXI4 and AXI4-Lite, implement transactions using five logical unidirectional channels that each resemble the AXI4-Stream, with ready/valid handshake flow-controlled interfaces. The five channels are Read Address (master to slave), Read Data (slave to master), Write Address (master to slave), Write Data (master to slave), and Write Response (slave to master). An AXI4 master writes to a slave by writing write transactions to the Write Address and Write Data channels and receiving responses on the Write Response channel. A slave receives write-command data on the Write Address and Write Data channels and responds by writing on the Write Response Channel. A master performs reads from a slave by writing read-transaction data to the Read Address channel and receiving responses from the Read Response channel. A slave receives read-command data on the Read Address channel and responds by writing data to the Read Response channel.
An AXI4 master or slave bridge converts the AXI4 protocol messages into NOC messages and vice-versa. In an embodiment, each AXI4 datum received on any of its five constituent channels is sent from a master (or slave) as a separate message over the NOC from source router (master (or slave)) to destination router (slave (or master)) where, if there is a corresponding AXI slave/master bridge, the message is delivered on the corresponding AXI4 channel. In another embodiment with higher performance, each AXI4 bridge collects as much AXI4 channel data as it can in a given clock cycle from across all of its input AXI4 input channels, and sends this collected data as a single message over the NOC to the destination bridge, which unpacks it into its constituent channels. In another embodiment, a bridge client waits until it receives enough channel data to correspond to one semantic request or response message such as “write request (address, data)” or “write response” or “read request(address)” or “read response(data),” and then sends that message to the destination client. This approach may simplify the interconnection of AXI4 masters or slaves to non-AXI4 client cores elsewhere on the NOC.
Thus a NOC-intermediated AXI4 transfer from an AXI4 master to an AXI4 slave actually traverses an AXI4 master to an AXI4 slave bridge-client core to a source router through the NOC to a destination router to an AXI4 master bridge-client core to the AXI4 slave (and vice-versa for response channel messages). As in the above description of AXI4-Stream bridging, in an embodiment it may be beneficial to implement credit-based flow control between client cores.
In a similar way, other IP interfaces described herein, without limitation, may be bridged to couple clients of those IP interfaces to the NOC, and thereby to other NOC clients.
An “AXI4 Interconnect IP” core is a special kind of system core whose purpose is to interconnect the many AXI4 IP cores in a system. In an embodiment, a Hoplite NOC plus a number of AXI4 bridge-client cores may be configured to implement the role of “AXI4 Interconnect IP”, and, as the number of AXI4 clients or the bandwidth requirements of clients scales up well past ten cores, this extremely efficient NOC+bridges implementation can be the highest-performance, and most resource-and-energy-efficient, way to compose the many AXI4 IP cores into an integrated system.
Another important type of internal NOC client is an embedded microprocessor. As described above, particularly in the description of the packet-processing system, an embedded processor may interact with other NOC clients via messages, to perform such functions as: read or write a byte, half word, word, double word, or quad word of memory or I/O data; read or write a block of memory; read or write a cache line; transmit a MESI cache-coherence message such as read, invalidate, or read for ownership; convey an interrupt or interprocessor interrupt; to explicitly send or receive messages as explicit software actions; to send or receive command or data messages to an accelerator core; to convey performance trace data; to stop, reset, or debug a processor; and many other kinds of information transfer amenable to delivery as messages. In an embodiment, an embedded-processor NOC client core may comprise a soft processor. In an embodiment, an embedded-processor NOC client core may comprise a hardened, full-custom “SOC” subsystem such as an ARM processor core in the Xilinx Zynq PS (processing subsystem). In an embodiment, a NOC client core may comprise a plurality of processors. In an embodiment, a NOC may interconnect a processor NOC client core and a second processor NOC client core.
The gradual slowing of conventional microprocessor-performance scaling, and the need to reduce energy per datacenter workload motivates FPGA acceleration of datacenter workloads. This in turn motivates deployment of FPGA accelerator cards connected to multiprocessor server sockets via PCI Express in datacenter server blades. Over several design generations, FPGAs will be coupled ever closer to processors.
Close integration of FPGAs and server CPUs can include advanced packaging wherein the server CPU die and the FPGA die are packaged side by side via a chip-scale interconnect such as Xilinx 2.5D Stacked Silicon Integration (SSI) or Intel Embedded Multi-Die Interconnect bridge (EMIB). Here an FPGA NOC client is coupled via the NOC, via an “external coherent interface” bridge NOC client, and via the external coherent interface, to the cache coherent memory system of the server CPU die. The external interconnect may support cache-coherent transfers and local-memory caching across the two dies, employing technologies such as, without limitation, Intel QuickPath Interconnect or IBM/OpenPower Coherence Attach Processor Interface (CAPI). This advance will make it more efficient for NOC clients on the FPGA to communicate and interoperate with software threads running on the server processors.
FPGA-server CPU integration can also include embedding an FPGA fabric onto the server CPU die, or equivalently, embed server CPU cores onto the FPGA die. Here it is imperative to efficiently interconnect FPGA-programmable accelerator cores to server CPU cores and other fixed-function accelerator cores on the die. In this era, the many server CPU cores will be interconnected to one another and to the “uncore” (i.e., the rest of the chip excluding CPU cores and FPGA fabric cores) via an uncore-scalable interconnect fabric such as a 2D torus. The FPGA fabric resources in this SOC may be in one large contiguous region or may be segmented into smaller tiles located at various sites on the die (and logically situated at various sites on the 2D torus). Here an embodiment of the disclosed FPGA NOC will interface to the rest of the SOC using “FPGA-NOC-to-uncore-NOC” bridge FPGA-NOC client cores. In an embodiment, FPGA NOC routers and uncore NOC routers may share the router addressing scheme so that messages from CPUs, fixed logic, or FPGA NOC client cores may simply traverse into the hard uncore NOC or the soft FPGA NOC according to the router address of the destination router. Such a tightly coupled arrangement facilitates efficient, high-performance communication amongst FPGA NOC client cores, uncore NOC client cores, and server CPUs cores.
Modern FPGAs comprise hundreds of embedded block RAMs, embedded fixed-point DSP blocks, and embedded floating-point DSP blocks, distributed at various sites all about the device. One FPGA system-design challenge is to efficiently access these resources from many clients at other sites in the FPGA. An FPGA NOC makes this easier.
Block RAMs are embedded static RAM blocks. Examples include 20 Kbit Altera M20Ks, 36 Kbit Xilinx Block RAMs, and 288 Kbit Xilinx Ultra RAMs. As with other memory interface NOC client cores described above, a block RAM NOC client core receives memory-load or store-request messages, performs the requested memory transaction against the block RAM, and (for load requests) sends a load-response message with the loaded data back to the requesting NOC client. In an embodiment, a block RAM controller NOC client core comprises a single block RAM. In an embodiment, a block RAM controller NOC client core comprises an array of block RAMs. In an embodiment, the data bandwidth of an access to a block RAM is not large—up to 10 bits of address and 72 bits of data at 500 MHz. In another embodiment employing block RAM arrays, the data bandwidth of the access can be arbitrarily large. For example, an array of eight 36 Kbit Xilinx block RAMs can read or write 576 bits of data per cycle, i.e., up to 288 Gbps. Therefore, an extremely wide NOC of 576 to 1024 bits may allow full utilization of the bandwidth of one or more of such arrays of eight block RAMs.
Embedded DSP blocks are fixed logic to perform fixed-point wide-word math functions such as add and multiply. Examples include the Xilinx DSP48E2 and the Altera variable-precision DSP block. An FPGA's many DSP blocks may also be accessed over the NOC via a DSP NOC client core. The latter accepts a stream of messages from its NOC router, each message encapsulating an operand or a request to perform one or more DSP computations; and a few cycles later, sends a response message with the results back to the client. In an embodiment, the DSP function is configured as a specific fixed operation. In an embodiment, the DSP function is dynamic and is communicated to the DSP block, along with the function operands, in the NOC message. In an embodiment, a DSP NOC client core may comprise an embedded DSP block. In an embodiment, a DSP NOC client core may comprise a plurality of embedded DSP blocks.
Embedded floating-point DSP blocks are fixed logic to perform floating-point math functions such as add and multiply. One example is the Altera floating-point DSP block. An FPGA's many floating-point DSP blocks and floating-point enhanced DSP blocks may also be accessed over the NOC via a floating-point DSP NOC client core. The latter accepts a stream of messages from its NOC router, each message encapsulating an operand or a request to perform one or more floating-point computations; and a few cycles later, sends a response message with the results back to the client. In an embodiment, the floating-point DSP function is configured as a specific fixed operation. In an embodiment, the floating-point DSP function is dynamic and is communicated to the DSP block, along with the function operands, in the NOC message. In an embodiment, a floating-point DSP NOC client core may comprise an embedded floating-point DSP block. In an embodiment, a DSP NOC client core may comprise a plurality of floating-point embedded DSP blocks.
A brief example illustrates the utility of coupling the internal FPGA resources, such as block RAMs and floating-point DSP blocks, with a NOC so that they may be easily and dynamically composed into a parallel-computing device. In an embodiment, in an FPGA, each of the hundreds of block RAMs and hundreds of floating-point DSP blocks are coupled to a NOC via a plurality of block RAM NOC client cores and floating-point DSP NOC client cores. Two vectors A[ ] and B[ ] of floating-point operands are loaded into two block RAM NOC client cores. A parallel dot product of the two vectors may be obtained by means of 1) the two vectors' block RAMs contents are streamed into the NOC as messages and both sent to a first floating-point DSP NOC client core, which multiplies them together; the resulting stream of elementwise products is sent by the first floating-point DSP NOC client core via the NOC to a second floating-point DSP NOC client core, which adds each product together to accumulate a dot product of the two vectors. In another embodiment, two N×N matrices A[,] and B[,] are distributed, row-wise and column-wise, respectively, across many block RAM NOC client cores; and an arrangement of N×N instances of the prior embodiment's dot-product pipeline are configured so as to stream each row of A and each column of B into a dot-product pipeline instance. The results of these dot-product computations are sent as messages via the NOC to a third set of block RAM NOC client cores that accumulate the matrix-multiply-product result C[,]. This embodiment performs a parallel, pipelined, high-performance floating-point matrix multiply. In this embodiment, all of the operands and results are carried between memories and function units over the NOC. It is particularly advantageous that the data-flow graph of operands and operations and results is not fixed in wires nor in a specific programmable-logic configuration, but rather is dynamically achieved by simply varying the (x,y) destinations of messages between resources sent via the NOC. Therefore, a data-flow-graph fabric of memories and operators may be dynamically adapted to a workload or computation, cycle by cycle, microsecond by microsecond.
Another important FPGA resource is a configuration unit. Some examples include the Xilinx ICAP (Internal Configuration Access Port) and PCAP (Processor Configuration Access Port). A configuration unit enables an FPGA to reprogram, dynamically, a subset of its programmable logic, also known as “partial reconfiguration”, to dynamically configure new hardware functionality into its FPGA fabric. By coupling an ICAP to the NOC by means of a configuration unit NOC client core, the ICAP functionality is made accessible to the other client cores of the NOC. For example, a partial-reconfiguration bitstream, used to configure a region of the programmable logic fabric, may be received from any other NOC client core. In an embodiment, the partial-reconfiguration bitstream is sent via an Ethernet NIC client core. In an embodiment, the partial-reconfiguration bitstream is sent via a DRAM channel NOC client core. In an embodiment, the partial-reconfiguration bitstream is sent from a hardened embedded-microprocessor subsystem via an embedded-processor NOC client core.
In a dynamic-partial-reconfiguration system, the partially reconfigurable logic is generally floor planned into specific regions of the programmable logic fabric. A design challenge is how this logic may be best communicatively coupled to other logic in the system, whether fixed programmable logic or more dynamically reconfigured programmable logic, anticipating that the logic may be replaced by other logic in the same region at a later moment. By coupling the reconfigurable logic cores to other logic by means of a NOC, it becomes straightforward for any reconfigurable logic to communicate with non-reconfigurable logic and vice versa. A partial-reconfig NOC client core comprises a partial-reconfig core designed to directly attach to a NOC router on a fixed set of FPGA nets (wires). A series of different partial-reconfig NOC client cores may be loaded at a particular site in an FPGA. Since each reconfiguration directly couples to the NOC router's message input and output ports, each enjoys full connectivity with other NOC client cores in the system.
In an embodiment, a data parallel compiler and runtime, such as, in some embodiments, an OpenCL compiler and runtime targets the many soft processors 220 and configured accelerator cores of the parallel computing system. In embodiment, an OpenCL compiler and runtime implements some OpenCL kernels in software, executed on a plurality of soft processors 220, and some kernels in hardware accelerator cores, connected as client cores on the NOC 150 or as configured accelerator cores 250 in clusters 250 in the system.
In an embodiment, accelerator cores 250 may be synthesized by a high level synthesis tool. In an embodiment, NOC client cores may be synthesized by a high level synthesis tool.
In an embodiment, a system floor-planning EDA tool incorporates configuration and floor planning of a parallel computing system and NOC topologies, and may be used to place and interconnect client core blocks to routers of the NOC.
Some applications of an embodiment include, without limitation, 1) reusable modular “IP” NOCs, routers, and switch fabrics, with various interfaces including AXI4; 2) interconnecting FPGA subsystem client cores to interface controller client cores, for various devices, systems, and interfaces, including DRAMs and DRAM DIMMs, in-package 3D die stacked or 2.5D stacked silicon interposer interconnected HBM/WideIO2/HMC DRAMs, SRAMs, FLASH memory, PCI Express, 1G/10G/25G/40G/100G/400G networks, FibreChannel, SATA, and other FPGAS; 3) as a component in parallel-processor overlay networks; 4) as a component in OpenCL host or memory interconnects; 5) as a component as configured by a SOC builder design tool or IP core integration electronic design automation tool; 4) use by FPGA electronic design automation CAD tools, particularly floor-planning tools and programmable-logic placement and routing tools, to employ a NOC backbone to mitigate the need for physical adjacency in placement of subsystems, or to enable a modular FPGA implementation flow with separate, possibly parallel, compilation of a client core that connects to the rest of system through a NOC client interface; 6) used in dynamic-partial-reconfiguration systems to provide high-bandwidth interconnectivity between dynamic-partial-reconfiguration blocks, and via floor planning to provide guaranteed logic- and interconnect-free “keep-out zones” for facilitating loading new dynamic-logic regions into the keep-out zones, and 7) use of the disclosed parallel computer, router and NOC system as a component or plurality of components, in computing, datacenters, datacenter application accelerators, high-performance computing systems, machine learning, data management, data compression, deduplication, databases, database accelerators, networking, network switching and routing, network processing, network security, storage systems, telecom, wireless telecom and base stations, video production and routing, embedded systems, embedded vision systems, consumer electronics, entertainment systems, automotive systems, autonomous vehicles, avionics, radar, reflection seismology, medical diagnostic imaging, robotics, complex SOCs, hardware emulation systems, and high frequency trading systems.
The various embodiments described above can be combined to provide further embodiments. These and other changes can be made to the embodiments in light of the above-detailed description. In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/274,745 filed on Jan. 4, 2016, entitled “MASSIVELY PARALLEL COMPUTER AND DIRECTIONAL TWO-DIMENSIONAL ROUTER AND INTERCONNECTION NETWORK FOR FIELD PROGRAMMABLE GATE ARRAYS AND OTHER CIRCUITS AND APPLICATIONS OF THE COMPUTER, ROUTER, AND NETWORK”, and claims the benefit of U.S. Provisional Patent Application Ser. No. 62/307,330 filed on Mar. 11, 2016, entitled “MASSIVELY PARALLEL COMPUTER AND DIRECTIONAL TWO-DIMENSIONAL ROUTER AND INTERCONNECTION NETWORK FOR FIELD PROGRAMMABLE GATE ARRAYS AND OTHER CIRCUITS AND APPLICATIONS OF THE COMPUTER, ROUTER, AND NETWORK”, both of which are hereby incorporated herein by reference. This application is related to U.S. patent application Ser. No. 14/986,532, entitled “DIRECTIONAL TWO-DIMENSIONAL ROUTER AND INTERCONNECTION NETWORK FOR FIELD PROGRAMMABLE GATE ARRAYS, AND OTHER CIRCUITS AND APPLICATIONS OF THE ROUTER AND NETWORK,” which was filed 31 Dec. 2015 and which claims priority to U.S. Patent App. Ser. No. 62/165,774, which was filed 22 May 2015. These related applications are incorporated by reference herein. This application is related to PCT/US2016/033618, entitled “DIRECTIONAL TWO-DIMENSIONAL ROUTER AND INTERCONNECTION NETWORK FOR FIELD PROGRAMMABLE GATE ARRAYS, AND OTHER CIRCUITS AND APPLICATIONS OF THE ROUTER AND NETWORK,” which was filed 20 May 2016, and which claims priority to U.S. Patent App. Ser. No. 62/165,774, which was filed on 22 May 2015, U.S. patent application Ser. No. 14/986,532, which was filed on 31 Dec. 2015, U.S. Patent App. Ser. No. 62/274,745, which was filed 4 Jan. 2016, and U.S. Patent Application Ser. No. 62/307,330, which was filed 11 Mar. 2016. These related applications are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
62274745 | Jan 2016 | US | |
62307330 | Mar 2016 | US |