Prefix scan is a basic primitive widely used in several parallel computing applications such as sorting, string comparison, array packing, solving linear systems, load balancing etc. A low latency and high throughput prefix scan implementation is important to scaling performance of such applications.
Typical implementations of prefix scan utilize multiple rounds of software-controlled computation and communication between the nodes. A pipelined software algorithm uses two passes over a binary tree, where each node executes one step of both passes in every round. However, this approach incurs high latency owing to overheads of data transfer from network to software memory. Further, compute resources on the nodes are reserved for calculating aggregations in prefix and coordinating inter-node messages. This can reduce the efficiency and scalability of the underlying application, especially for a system software or applications that use prefix scan frequently, such as radix sort.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
Embodiments of methods and apparatus for in-network parallel prefix scan are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
For clarity, individual components in the Figures herein may also be referred to by their labels in the Figures, rather than by a particular reference number. Additionally, reference numbers referring to a particular type of component (as opposed to a particular component) may be shown with a reference number followed by “(typ)” meaning “typical.” It will be understood that the configuration of these components will be typical of similar components that may exist but are not shown in the drawing Figures for simplicity and clarity or otherwise similar components that are not labeled with separate reference numbers. Conversely, “(typ)” is not to be construed as meaning the component, element, etc. is typically used for its disclosed function, implement, purpose, etc.
In accordance with aspects of the embodiments disclosed herein, a dual binary tree topology to compute prefix scans in-network is provided that leverages switches to perform aggregation operations on data contained in data packets as they move through the network. By embedding the topology on a network, prefix computation can be completely offloaded to the network with a single transfer of values from process memory to the network and vice-versa (to obtain the results of the prefix scan calculation). This drastically reduces synchronization, accelerates prefix scan calculation, and enables computation-communication overlap.
The proposed topology may be embedded in a physical network, and can be scaled to large multi-dimensional networks such as PIUMA (Programmable Integrated Unified Memory Architecture). The embodiments exhibit a latency logarithmic in the number of participating processes and can compute multiple element-wise prefix scans in a pipelined manner when each die/process contributes a vector of elements. This disclosure also describes the performance bottlenecks of the topology and an embedding recommendation to generate a high throughput prefix scan pipeline.
The basic principle is to use two binary trees in a feed-forward topology to implement an exclusive prefix scan computation pipeline in the network. The formulation of the exclusive prefix scan is given as follows:
y0=I⊕
y1=x0
y2=x0⊕x1
yl=x0⊕x1 . . . ⊕xl−1
where yi and xi are the output and input of the ith process respectively, ⊕ is the operation to be performed (e.g., sum, multiplication, max, etc.), and I⊕ is the identity value of ⊕ (e.g., 0 for sum, 1 for multiplication, etc.). Note that the exclusive scan is more generic because it can be easily converted to inclusive scan (by computing yi⊕xi at ith process) whereas vice-versa may not be possible for some operators ⊕ such as max.
The processes insert input values xi on the leaves of one of the trees (called the up tree) and receive the outputs yi on the leaves of the other tree (called the down tree). Nodes at intermediate levels acquire their values by aggregating data in in-flight packets only, and do not require any initialization or interaction with the process memory.
In further detail, two binary aggregation trees in a feed-forward topology are implemented such that leaves of one tree represent the input array values inserted in the network and leaves of the second tree represent the output prefix scan values. The output is computed on-the-fly as the data is routed through the tree edges, with calculations being made using compute engines in the switches. Thus, prefix scan computation can be completely offloaded to the network by mapping (i) leaves of both the trees to the process memory and network interface and (ii) aggregator nodes in the trees to network switches.
Input is injected on leaf nodes 102 of the up tree when a process (one of P0, P1, . . . P7) calls the instruction corresponding to prefix scan in the instruction set architecture (ISA) of a core (or other type of compute unit). Values received on leaves 104 of down trees are deposited back into the process memory. The proposed topology also allows pipelined computation of multiple prefix scans over an array of values per process. When element-wise prefix on an array is computed, the calling process specifies the location of the array in the local memory along with the number of elements in the array. The collective engine will insert these values into the network one by one. It will also count the number of values outputted to the calling process and indicate completion when number of values outputted become equal to number of input values ingested in the network.
Note that the rightmost process P7 inserts 0 into the tree even though the input value is 9. This is because as per the formulation of exclusive scan, the value of the last process (P7 in this example) is not included in the output at any other process. Moreover, the output of exclusive prefix scan to the first process (P0 in this example) is the identity element under ⊕ (which is 0 for addition). Therefore, the aggregators on the rightmost arm of the up tree pass the 0 value unchanged to the root vertex, which further passes it to the down tree. This also eliminates the need to initialize the root of the down tree with the identity element. Existing software implementations use root process to perform such initialization, which is not feasible in an in-network computing scenario.
In some embodiments, distributed compute components are used to perform graph processes, such as illustrated in
At a physical component level, the smallest unit in the PIUMA architecture is a PIUMA die, which is integrated as a System on a Chip (SoC), also referred to as a PIUMA chip or PIUMA socket. As explained an illustrated below, a PIUMA die/socket includes multiple core tiles and switch tiles. In the illustrated embodiment, a core in a PIUMA core tile 302 has two types of cores including multi-thread cores (MTCs) 304 and single-thread cores (STCs) 306.
MTCs 304 comprise round-robin multi-threaded in-order pipelines. At any moment, a thread can only have one in-flight instruction, which considerably simplifies the core design for better energy efficiency. STCs 306 are used for single-thread performance sensitive tasks, such as memory and thread management threads (e.g., from the operating system). These are in-order stall-on-use cores that are able to exploit some instruction and memory-level parallelism, while avoiding the high-power consumption of aggressive out-or-order pipelines. In one embodiment, both core types implement the same custom RISC instruction set.
A MTC and STC has a small data and instruction cache (D$ and I$), and a register file (RF) to support its thread count. For multi-thread core 304 this includes a data cache (D$) 308, an instruction cache (I$) 310, a register file 312. For single-thread core 306 this includes a D$ 314, an I$ 316, and a register file 318. A multi-thread core 304 also includes a core offload engine 320, while a single-thread core 306 includes a core offload 322.
Because of the low locality in graph workloads, no higher cache levels are included, avoiding useless chip area and power consumption of large caches. In one embodiment, for scalability, caches are not coherent across the whole system. It is the responsibility of the programmer to avoid modifying shared data that is cached, or to flush caches if required for correctness. MTCs 304 and STCs 306 are grouped into Cores 324 (also called blocks), each of which has a large local scratchpad (SPAD) 326 for low latency storage, a block offload engine 328, and local memory (e.g., some form of Dynamic Random Access Memory (DRAM) 330). Programmers are responsible for selecting which memory accesses to cache (e.g., local stack), which to put on SPAD (e.g., often reused data structures or the result of a DMA gather operation) and which not to store locally. There are no prefetchers to avoid useless data fetches and to limit power consumption. Instead, block offload engines 328 can be used to efficiently fetch large chunks of useful data.
Although the MTCs hide some of the memory latency by supporting multiple concurrent threads, their in-order design limits the number of outstanding memory accesses to one per thread. To increase memory-level parallelism and to free more compute cycles to the cores, a memory offload engine (block offload engine 328) is added to a Core 324. The block offload engine performs memory operations typically found in many graph applications in the background, while the cores continue with their computations. The direct memory access (DMA) engine in block offload engine 328 performs operations such as (strided) copy, scatter and gather. Queue engines are responsible for maintaining queues allocated in shared memory, alleviating the core from atomic inserts and removals. They can be used for work stealing algorithms and dynamically partitioning the workload. Collective engines implement efficient system-wide reductions and barriers. Remote atomics perform atomic operations at the memory controller where the data is located, instead of burdening the pipeline with first locking the data, moving the data to the core, updating it, writing back and unlocking. They enable efficient and scalable synchronization, which is indispensable for the high thread count in PIUMA.
The engines are directed by the PIUMA cores using specific PIUMA instructions. These instructions are non-blocking, enabling the cores to perform other work while the operation is done in the background. Custom polling and waiting instructions are used to synchronize the threads and offloaded.
A core 402 is connected to a respective memory controller (MC) 410, which in turn is connected to process memory comprising DRAM 412. As illustrated for socket 400-0, Each of the lower pair of cores in a core tile or lower pair of switches in a switch tile are connected to a pair of network controllers (NC) 414, while each of the upper pair of cores in a core tile or switches in a switch tile are connected to a pair of inter-die network interfaces (INDI) 416.
A pair of bidirectional links 418 connect each switch 404 of a tile T to corresponding core or switch (as applicable) in the tile towards the left or right of T. A switch in a switch tile 408 is interconnected with the other switches in the switch tile via bidirectional links 420.
PIUMA switches are configured to perform in-flight packet reduction (reduction on both packets and data contained in the packets) and include configurable routing capabilities that allow collective topologies to be embedded into the network. Their flow control mechanism further enables pipelined computation over numerous single element packets for high throughput vector collectives.
Collective packets in a PIUMA network are routed on an exclusive virtual channel. The scheduling mechanism in PIUMA switches prioritizes packets on a collective virtual channel. Hence, performance of in-network collectives is unaffected by the rest of the network traffic.
An input port of the switch has a FIFO buffer associated with the collective virtual channel for transient storage of the data packets. For an in-network prefix scan, these buffers constitute the network memory available for storage of partial sums.
A PIUMA switch has configuration registers that specify the connectivity between input-output (IO) ports for the collective virtual channel. As a given port is connected to a fixed neighboring switch, configuration registers effectively provide a low-level control over the routing paths in a network embedding.
Additionally, a switch consists of a Collective Engine (CENG) that can reduce in-flight packets on multiple input ports. Configuration registers of the switch also specify the input ports participating in reduction by the CENG, and the output port where the reduction result is forwarded. Embedding prefix scan into a PIUMA network can therefore be reduced to the problem of setting the switch configurations such that routing and reduction patterns in the network emulate a logical topology of the prefix scan. The CENG can also perform the applicable ⊕ operations (e.g., sum, multiplication, max, etc.) used for calculating the prefix scans in-network, wherein the calculations are completely offloaded from the cores or other types of compute units coupled to the network.
The terms dies and sockets are generally used interchangeably herein. A PIUMA subnode or node may comprise multiple integrated circuit dies that are arranged on a substrate and interconnected via “wiring” formed in the substrate. A PIUMA socket may generally comprise an integrated circuit (IC) chip that is a separate component (or otherwise a separate “package”). For a PIUMA subnode or node comprised of PIUMA sockets, the sockets may be mounted to a printed circuit board (PCB) or the like, or may be configured using various other types of packaging such as using a multi-chip module or a multi-package module.
The proposed topology is suitable for in-network computation due to uniform resource distribution. In one embodiment, one aggregator from the up and down trees is associated with a respective process (e.g., aggregators in highlighted region of
The proposed topology is also highly scalable. Note that no vertices are associated to P0 in
PIUMA implements a distributed global address space with a HyperX topology connecting the nodes, and an on-chip mesh for connectivity within a node as shown in
The ports on these switches provide connectivity at different levels of the network hierarchy. In one embodiment, sockets within a node and peer nodes in any dimension of the HyperX are all-to-all connected. These dense connectivity patterns substantially simplify embedding of prefix scan. The hierarchical design also allows low-latency optical interconnections for long distance links between sockets and nodes.
For functional correctness, an embedding should guarantee deadlock free operation. Embedding the proposed topology on a physical system employs a simple deadlock avoidance mechanism. Deadlocks occur when there are cycles in the dependency graph of aggregators. Dependencies can be fundamental to the logical topology or can arise as a characteristic of the mapping. Given a vertex v and its parent vp in the dual-binary tree topology, the following fundamental dependencies can be seen in
In a compute capable network, a switch is used for both data aggregation and forwarding. Typically, the input packet on a switch is consumed if all output ports for that packet (including the aggregator if used) are ready to forward or operate upon the input data. This can create additional embedding induced dependencies between two aggregators.
The flow control rules for multicasting can induce dependencies that when combined with fundamental data dependencies in the topology, may cause deadlocks. For example, consider the embedding shown in
The output packets from left child Ulc are multicasted to both Ui and Di. At Di, they must wait for the corresponding partial sum from Dp. During the wait period, they are stored in the limited capacity buffers on the embedded path between Ulc and Di. Buffers on short paths (small aggregate capacity) and at lower levels of the tree (wait period collective latency) can fill up and stall packet insertion in the network pipeline.
The embedding of the proposed dual-tree topology may employ the following rule to avoid deadlocks: for a vertex v, if the edge from left child to the up tree aggregator is embedded in a path p on the network, the down tree aggregator should not be mapped to a switch S that lies on the path p. This guarantees no cycles in the dependency graph and hence, avoids deadlocks.
If the maximum dilation of an edge in the embedding is constant, the worst-case latency of the proposed topology is logarithmic in the number of processes (or elements in prefix scan). However, when operating in a pipelined manner, another performance metric to optimize is the throughput achieved by the topology. This disclosure describes the performance bottlenecks in the proposed topology and recommends simple embedding mechanisms that can alleviate these bottlenecks.
When the prefix scan is working in a pipelined manner, multiple inputs are queued for processing. As shown in the in
Typical software-based approaches deal with such issues by storing this input value in process memory. However, in an in-network computation scenario, this may not be feasible, and the value will be stored in-flight using link buffers. Specifically, buffers of those links are used on which the edge from left child input to the down tree aggregator is embedded. When multiple input values are queued, the (limited capacity) link buffer can get full and stall the pipeline, as shown in the example embedding of
When embedding the proposed topology, the dilation of these selective edges that carry partial sums from up tree to the down tree (Up tree to Down tree edges in
As an example, when embedding a vertex on a PIUMA die, the unused links can be included in the mapping to increase the dilation of this edge. As shown in
The components in
Generally, a reduction operator for a prefix scan can either be pre-programmed using its configuration register, or opcodes may be included in messages sent to the reduction operator to instruct the reduction operator to perform a corresponding reduction operation. For example, a multi-bit opcode may be provided in a message that is parsed by a switch and based on the multi-bit opcode the reduction operator in the switch determines what prefix scan operation to perform.
In some embodiments, a switch or switch tile may include an Infrastructure Processing Unit (IPU) or a Data Processing Unit (DPU). Switches may also comprise hardware programmable switches using languages such as but not limited to P4 (Programming Protocol-independent Packet Processors) and NPL (Network Programming Language).
Generally, various types of point-to-point interconnects may be used for intra-socket/intra-die, inter-socket/inter-die, and interconnects between subnodes or nodes employing links and associated protocols including but not limited to: Peripheral Component Interconnect express (PCIe), Intel® QuickPath Interconnect (QPI), Intel® Ultra Path Interconnect (UPI), Intel® On-Chip System Fabric (IOSF), Omnipath, Compute Express Link (CXL), HyperTransport, high-speed fabric, NVLink, Advanced Microcontroller Bus Architecture (AMBA) interconnect, OpenCAPI, Gen-Z, and Cache Coherent Interconnect for Accelerators (CCIX) Ethernet (IEEE 802.3), remote direct memory access (RDMA), InfiniBand, Internet Wide Area RDMA Protocol (iWARP), quick UDP Internet Connections (QUIC), and RDMA over Converged Ethernet (RoCE).
Generally, the switches used to interconnect nodes and subnodes may include Top of Rack (ToR) switches, leaf switches, spline (backbone) switches, and other types of switches that are deployed in data centers and HPC environments. These switches may employ one or more of the links and protocols above.
Although some embodiments have been described in reference to particular implementations, other implementations are possible according to some embodiments. Additionally, the arrangement and/or order of elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some embodiments.
In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.
In the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. Additionally, “communicatively coupled” means that two or more elements that may or may not be in direct contact with each other, are enabled to communicate with each other. For example, if component A is connected to component B, which in turn is connected to component C, component A may be communicatively coupled to component C using component B as an intermediary component.
An embodiment is an implementation or example of the inventions. Reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the inventions. The various appearances “an embodiment,” “one embodiment,” or “some embodiments” are not necessarily all referring to the same embodiments.
Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular embodiment or embodiments. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.
As discussed above, various aspects of the embodiments herein may be facilitated by corresponding software and/or firmware components and applications, such as software and/or firmware executed by an embedded processor or the like. Thus, embodiments of this invention may be used as or to support a software program, software modules, firmware, and/or distributed software executed upon some form of processor, processing core or embedded logic a virtual machine running on a processor or core or otherwise implemented or realized upon or within a non-transitory computer-readable or machine-readable storage medium. A non-transitory computer-readable or machine-readable storage medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a non-transitory computer-readable or machine-readable storage medium includes any mechanism that provides (e.g., stores and/or transmits) information in a form accessible by a computer or computing machine (e.g., computing device, electronic system, etc.), such as recordable/non-recordable media (e.g., read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.). The content may be directly executable (“object” or “executable” form), source code, or difference code (“delta” or “patch” code). A non-transitory computer-readable or machine-readable storage medium may also include a storage or database from which content can be downloaded. The non-transitory computer-readable or machine-readable storage medium may also include a device or product having content stored thereon at a time of sale or delivery. Thus, delivering a device with stored content, or offering content for download over a communication medium may be understood as providing an article of manufacture comprising a non-transitory computer-readable or machine-readable storage medium with such content described herein.
Various components referred to above as processes, servers, or tools described herein may be a means for performing the functions described. The operations and functions performed by various components described herein may be implemented by software running on a processing element, via embedded hardware or the like, or any combination of hardware and software. Such components may be implemented as software modules, hardware modules, special-purpose hardware (e.g., application specific hardware, ASICs, DSPs, etc.), embedded controllers, hardwired circuitry, hardware logic, etc. Software content (e.g., data, instructions, configuration information, etc.) may be provided via an article of manufacture including non-transitory computer-readable or machine-readable storage medium, which provides content that represents instructions that can be executed. The content may result in a computer performing various functions/operations described herein.
As used herein, a list of items joined by the term “at least one of” can mean any combination of the listed terms. For example, the phrase “at least one of A, B or C” can mean A; B; C; A and B; A and C; B and C; or A, B and C.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
This invention was made with Government support under Agreement No. HR0011-17-3-0004, awarded by DARPA. The Government has certain rights in the invention.