Computerized systems typically rely on network connections to transfer data, whether from one computer system to another computer system, one computer component to another computer component, or from one processor to another processor in the same computer. Most computer networks link multiple computerized elements to one another, and include various functions such as verification that a message or other data sent over the network arrived at the intended recipient, confirmation of the integrity of the data, and a method of routing a message to the intended recipient on the network.
These and other basic network functions are used to ensure that a message or data sent via a computerized network reaches the intended recipient intact. When networks are congested, messages may not be forwarded through the network efficiently and reach the intended destination in a timely manner or in the order sent. Various problems such as broken routing links, deadlocks, livelocks, and message prioritization can result in some messages being delayed, rerouted, or in extreme cases failing to arrive at the intended destination altogether.
Similarly, when networks become noisy, or when a network connection is faulty, network messages can be lost and not reach the intended destination, and transfers of large blocks of data may become delayed. This is commonly due to physical factors like electrical noise, poor connections, broken or damaged wires, impedance mismatches between network components, and other such factors.
For these and other reasons, many computerized networks implement various forms of flow control, such as requiring acknowledgment that a first packet or message in a sequence of packets or messages has been received by the intended recipient before sending the second packet or message. Sometimes, packet transmissions are prioritized so that more urgent data is transmitted with a higher priority when the network becomes congested or faulty.
It is desired to provide fast, reliable, and efficient messaging between elements in a computerized network.
This document discusses, among other things, apparatuses, systems, and methods for moving data within a computerized system. A system example includes a plurality of processing nodes, a physical channel configured to transfer data between a memory local to a processing node and a network target remote from the processing node, and a block transfer engine configured to allocate a plurality of virtual channels to the physical channel and to transfer a plurality of address-overlapping blocks of data simultaneously using the virtual channels.
A method example includes providing a physical channel to transfer data between a memory local to a processing node and a target remote from the processing node, allocating a plurality of virtual channels to the physical channel, and asynchronously and simultaneously transferring a plurality of address-overlapping blocks of data to the target using the virtual channels.
This overview is intended to provide an overview of the subject matter of the present patent application. It is not intended to provide an exclusive or exhaustive explanation of the invention. The detailed description is included to provide further information about the subject matter of the present patent application.
In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and specific embodiments in which the invention may be practiced are shown by way of illustration. It is to be understood that other embodiments may be used and structural or logical changes may be made without departing from the scope of the present invention.
The physical channel 120 is part of the interconnection network of the multiprocessor system. In some embodiments, the interconnection network includes a hypercube topology. In some embodiments, the interconnection network includes a CLOS topology. In some embodiments, the interconnection network includes a folded CLOS topology. In some embodiments, the interconnection network includes a butterfly topology.
The computerized system 100 includes a Block Transfer Engine (BTE) 125. The BTE 125 supports asynchronous block transfers over the physical channel between a local memory 115A and the remote network target. The BTE is programmed by a local processor to move data asynchronously between local and remote memory. Because of overhead in using the BTE 125, the BTE 125 may be more useful for large, asynchronous data block transfers between processing nodes 105A-105D. The asynchronous block transfers include privileged memory-to-memory copies of data between processing nodes 105A-105D, such as a Remote Direct Memory Access (RDMA) put/get style of transfers. The asynchronous transfers also include privileged messages between processing nodes 105A-105D, such as send/receive style inter-process communication mechanisms.
The BTE 125 allocates a plurality of virtual channels to the physical channel 120. A virtual channel is a communication channel that timeshares the physical channel 120 with other virtual channels. Each virtual channel includes its own buffers to avoid transfer deadlock. This allows the BTE 125 to transfer a plurality of address-overlapping blocks of data simultaneously (e.g., in parallel) using the virtual channels, while reducing the occurrences of channel lock out.
If a virtual channel 205A has to wait for access to the physical channel, the virtual channel 205A may receive another request to transfer data. The virtual channel 205A includes at least one virtual channel buffer 215 to store data associated with a request for access to the virtual channel 205A when the virtual channel 205A receives simultaneous requests for such access.
The BTE 200 includes a block transfer controller (BTC) 220. In some embodiments, the BTC 220 is a state machine that governs remote memory transfers. The BTE 200 also includes a packet generator 225 to create packets for transmission to a remote target. A message sent by the BTE 200 may include a set of request packets that include one or more of a destination node, an address, a command, a tag, and a source node. If the message is a PUT message, the message includes packets that contain data. Each virtual channel 205A-205D within the BTE 200 may be assigned a unique identifier (ID). A message may include the virtual channel ID and an address within the virtual channel buffer 215.
In some embodiments, the BTE 200 allocates at least one of a BTC 220 or a packet generator to each virtual channel 205A-205D. If each virtual channel 205A-205D is allocated a block transfer controller and a packet generator, the BTE may complete the block transfers in a sequence different from a sequence in which the block transfers were initiated. In some embodiments, the BTE 200 allocates a BTC 220 or a packet generator 225 to more than one virtual channel 205A-205D. Thus, there may be more virtual channels than there are BTCs 220 or packet generators 225.
According to some embodiments, the BTE 200 includes one or more channel descriptor tables 230. In some embodiments, each virtual channel 205A-205D includes a channel descriptor table 230. In some embodiments, a channel descriptor table 230 is partitioned among more than one virtual channel 205A-205D.
In some embodiments, the channel descriptor table 230 includes transmit (TX) and receive (RX) channel descriptors. These may be organized into a TX descriptor table and a RX descriptor table within the channel descriptor table 230. The TX and RX channel descriptors are entries in the channel descriptor table 230 that are used to describe virtual channel transfers. For example, if the network target of a transfer includes a memory remote from a processing node, the BTE 200 asynchronously transfers respective blocks of data over respective virtual channels 205A-205D between the processing node and the remote memory according to TX and RX channel descriptors in respective channel descriptor tables 230. Use of the virtual channels 205A-205D allows address ranges of the blocks of the data transferred according to the descriptor table 230 to overlap in the remote memory.
The TX and RX channel descriptors may be used to configure a virtual channel 205A. For example, the TX and RX channel descriptors may be used to reset a virtual channel 205A, such as by initializing descriptor indices. The channel descriptors may also be used to enable data length checking on incoming messages to ensure that the data length does not exceed the size of a receive buffer, specify a maximum time for processing a message, and/or enable aggregation of message interrupts. In some embodiments, when aggregating interrupts, pending interrupt requests are accumulated during a specified time period and delivered as a single interrupt.
If each virtual channel 205A-205D includes a channel descriptor table 230, a virtual channel 205A is configured with the channel descriptor table 230. If a channel transfer descriptor table 230 is partitioned among the virtual channels 205A-205D, a respective virtual channel 205A may be configured with a respective channel transfer descriptor table partition.
The BTE 200 includes a TX queue (not shown) for each virtual channel 205A-205D. In some embodiments, the TX queue is implemented as a circular buffer. A TX descriptor configures the TX message, and the BTE 200 consumes a TX descriptor when processing a TX message. TX descriptors are consumed by the BTE 200 at the beginning or front of the TX queue. An application or process running on a processing node formulates a TX descriptor and adds it to the end of the queue. Thus, the channel descriptor table 230 may be accessed by the BTE 200 or by a process. In some examples, the TX descriptor may specify the type of transfer (e.g., SEND, PUT, or GET), and specify a type of routing (e.g., adaptive routing) for the message.
As with the TX queue, the BTE 200 includes a RX queue (not shown) for each virtual channel 205A-205D. The RX queue is used for posting (e.g., reserving or allocating) buffers to receive incoming data on remote target nodes in the computer system. An RX descriptor may specify the length of data in the message and/or may specify an address in the receiving buffer.
In some embodiments, the computerized system 100 of
The sender process or first process 130A may specify an address of the source data in local memory 115A and a target network endpoint (e.g., local memory 115C), but may not specify a target address at the network endpoint. The BTE 125 transfers the data associated with the message to the target network endpoint using a virtual channel.
The receiving or second process 130C pre-allocates one or more buffers to receive the data associated with the message. The virtual channel of the BTE 125 used in the transfer places the data in the pre-allocated buffers according to the RX descriptor for the message. If no buffer has been allocated when the data arrives at the network target, the virtual channel drops the data; the data is not written and is lost.
In some embodiments, asynchronously transferring data for inter-kernel messaging includes placing data associated with a kernel message into a pre-allocated buffer at the target. The data may be pre-allocated by posting a buffer in a receive queue of a descriptor table used to describe transfers over the virtual channels. A descriptor entry in the receive queue may indicate a network endpoint as the target of the message instead of a target address. Data arriving at the network may be dropped if no buffer is posted when the data arrives at the target.
The accompanying drawings that form a part hereof, show by way of illustration, and not of limitation, specific embodiments in which the subject matter may be practiced. The embodiments illustrated are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed herein. Other embodiments may be utilized and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. This Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.
Such embodiments of the inventive subject matter may be referred to herein, individually and/or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept if more than one is in fact disclosed. Thus, although specific embodiments have been illustrated and described herein, it should be appreciated that any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations, or variations, or combinations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.
The Abstract of the Disclosure is provided to comply with 37 C.F.R. §1.72(b), requiring an abstract that will allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own.