A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This invention relates to transmission of digital information over a communications network. More particularly, this invention relates to efficient memory operations on data carried over a communications network.
The meanings of certain acronyms and abbreviations used herein are given in Table 1.
An I/O device, such as a NIC, may receive incoming data to be scattered in small chunks to different locations in the local host memory, or may be requested to gather and transmit small chunks of data from different locations in the local host memory.
The I/O device typically accesses the local host memory by initiating transactions on an uplink channel to the host, such as a PCIe bus. In systems that are known in the art, the NIC initiates a separate PCIe transaction for each chunk of data that it must scatter or gather in host memory. Each offload transaction requires its own uplink header. As a result, the transactions consume substantial bandwidth and incur high latency.
Attempts have been made in the art to save memory bandwidth in transactions of the sort described above. For example, U.S. Patent Application Publication No. 2014/0040542 proposes a scatter-gather technique that optimizes streaming memory accesses, dealing with irregular memory access patterns using a scatter-gather engine that is cooperative with a memory cache.
According to disclosed embodiments of the invention, the NIC does not initiate a separate offload transaction for each chunk of data, but rather sends a single descriptor over the uplink, specifying multiple addresses in the host memory from which data are to be gathered or scattered. A module on the other end of the uplink, typically associated with or a part of the host memory controller, executes the uplink descriptor and thus carries out all of the relatively small gather or scatter operations on behalf of the NIC. Only a single offload transaction is required, regardless of the number of separate chunks of data that need to be gathered or scattered.
There is provided according to embodiments of the invention a method of communication, which is carried out in a network interface device connected to a host computer by an uplink. The host computer includes a memory, a memory controller, and a scatter-gather offload engine linked to the memory controller. The network interface device is configured for preparing a descriptor including a plurality of specified memory locations, incorporating the descriptor in exactly one upload packet, transmitting the upload packet from the network interface device to the scatter-gather offload engine via the uplink, invoking the scatter-gather offload engine to perform memory access operations cooperatively with the memory controller at the specified memory locations of the descriptor, and to return results of the memory access operations to the network interface device.
According to one aspect of the method, the memory access operations comprise a gather operation, and the results comprise stored data that is read from the specified memory locations.
According to a further aspect of the method, the memory access operations comprise a scatter operation, and the descriptor also includes data to be stored at respective ones of the specified memory locations.
According to yet another aspect of the method, returning results includes transmitting the results from the host computer to the network interface device in a response packet via the uplink.
According to still another aspect of the method, preparing a descriptor includes generating respective pointers to the specified memory locations.
According to an additional aspect of the method, preparing a descriptor includes specifying a stride that defines an offset between successive ones of the specified memory locations.
According to another aspect of the method, preparing a descriptor also includes incorporating a base address of the memory therein, and specifying respective sizes of data segments at the specified memory locations.
According to one aspect of the method, preparing a descriptor also includes specifying a total size of the data segments at the specified memory locations and specifying a number of times to repeat a pattern defined by the stride and an offset between successive instances of the pattern.
There is further provided according to embodiments of the invention an apparatus for communication, including a network interface device connected to a data network, and a host computer connected to the network interface device by an uplink. The host computer comprises a memory, a memory controller, and a scatter-gather offload engine linked to the memory controller. The network interface device is operative for preparing a descriptor including a plurality of specified memory locations, incorporating the descriptor in exactly one upload packet, transmitting the upload packet from the network interface device to the scatter-gather offload engine via the uplink, invoking the scatter-gather offload engine to perform memory access operations cooperatively with the memory controller at the specified memory locations of the descriptor, and accepting results of the memory access operations via the uplink.
For a better understanding of the present invention, reference is made to the detailed description of the invention, by way of example, which is to be read in conjunction with the following drawings, wherein like elements are given like reference numerals, and wherein:
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the various principles of the present invention. It will be apparent to one skilled in the art, however, that not all these details are necessarily always needed for practicing the present invention. In this instance, well-known circuits, control logic, and the details of computer program instructions for conventional algorithms and processes have not been shown in detail in order not to obscure the general concepts unnecessarily.
Documents incorporated by reference herein are to be considered an integral part of the application except that, to the extent that any terms are defined in these incorporated documents in a manner that conflicts with definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
Host complex 22 and memory 24 may be of standard design. Memory 24 may comprise DRAM, but may be any suitable memory. The term host complex refers to a central processing unit (CPU 30) of computer 20 along with associated components. Such components typically include a memory controller 32 and an expansion bus controller 36, which may be integrated with the CPU 30 on a single integrated circuit chip or provided on one or more separate chips. CPU 30 communicates with memory 24 via memory controller 32, which typically has a DDR interface for this purpose. Alternatively, memory 24 and memory controller 32 may have memory interfaces of other sorts, which may be in accordance with any other applicable standard. Typically, CPU 30 also comprises at least one cache 42, which holds copies of data from memory 24 for low-latency access by the CPU, as is known in the art. Linked to memory controller 32 is scatter-gather offload engine 34, which facilitates certain memory operations. The scatter-gather offload engine 34 is typically implemented as a dedicated hardware logic unit, which may be a part of the memory controller 32. Alternatively the scatter-gather offload engine 34 may be incorporated in the CPU 30 or realized as a separate hardware component. In any case, the scatter-gather offload engine 34 receives offload transactions from the NIC 26 and executes the transactions by scattering or gathering data to or from the memory 24 as specified. The functions of the scatter-gather offload engine 34 are described below in detail.
Expansion bus controller 36, such as a PCI bus controller, communicates via an expansion bus, such as a PCI Express bus 38, with input/output (I/O) and other peripheral devices 40 in computer 20.
In systems that are known in the art, memory controller 32 of host complex 22 is connected directly to system memory 24 via interfaces (not shown). Memory access by NIC 26 in connection with packet transmission and reception over network 28 likewise may take place via a bus interface (not shown) and connection 44, such as a PCI Express bus, to a memory control module 46. The control module 46 comprises the memory controller 32 and scatter-gather offload engine 34. A memory access request originating in NIC 26 may be directed either to memory controller 32 or, in certain cases described below, to the scatter-gather offload engine 34. In the latter the scatter-gather offload engine 34 controls and coordinates activities of the memory controller 32 according to descriptors received from the NIC 26. The interface between the scatter-gather offload engine 34 and the memory controller 32 uses read/write commands compatible with the memory controller's specification.
In embodiments of the invention, the scatter-gather offload engine 34 can be implemented inside the memory controller, in the CPU, or in any other location within the host complex 22.
In initial step 48 the NIC 26 prepares a single offload transaction, e.g., a PCIe transaction containing a descriptor comprising multiple specified memory locations at which to gather or scatter data. The descriptor may comprise a list of pointers to the memory 24. Alternatively, the descriptor may comprise a stride between consecutive elements of a source array. For example, a stride could be 1 MB between memory accesses of 8 Bytes. Further alternatively, the descriptor can comprise a pattern specifying the layout of the memory locations that are to be gathered from or scattered to in the single offload transaction.
An exemplary shared memory API is presented in Listing 1.
void shmem_iput(TYPE*dest, const TYPE*source, ptrdiff_t dst, ptrdiff_t sst, size_t nelems, int pe);
To initiate the offload transaction, in step 50 the NIC 26 generates an uplink-destined packet (also referred to as an “uplink packet”) containing the descriptor described in initial step 48. For a scatter operation, the uplink packet includes the data that are to be scattered or references to the data. For a gather operation, the descriptor in the uplink packet defines the data that are to be fetched. The scatter-gather offload engine 34 fetches only the required data from the host memory.
On the other end of the uplink connection 44, in step 52 the scatter-gather offload engine 34 receives the uplink packet and executes the actions specified in the uplink descriptor to carry out all of the gather or scatter operations on behalf of the NIC cooperatively with the memory controller 32.
Step 52 comprises steps 54, 56 in which scatter and gather operations are carried out respectively on behalf of the NIC 26. Gathering data from memory is actually a packing process. As noted above, a collection limited to needed data segments from memory is described by a single descriptor in uplink packets—e.g., a single header or control is used to pack together all data segments into a single packet. In different embodiments, packed data can be sent over the network, and unpacked in a remote node to be distributed to a remote memory, e.g., by the scatter operation described above.
Then, reverting to
Referring again to
If the data is to be gathered conventionally, a standard uplink request, e.g., a PCIe non-posted read request, is transmitted to the memory controller 32, indicated by arrow 80. However, if the data is to be consolidated into multiple read requests, as described with reference to
For example, the descriptor control block 78 may invoke the upload descriptor builder 82 when the gather list in the WQE contains more than a threshold number of separate addresses and the data to be read from each address is smaller than a threshold size. The decision parameters can be configurable in each system for performance tuning.
In the example shown in Table 2, a descriptor contains a list of pointers to memory. The parameters are:
Address: Memory address of data.
Length: Size of the data to be gathered or scattered
Last: If set, the last pointer in the descriptor. If clear, the next segment is another pointer.
In the example shown in Table 3, a descriptor specifies data access according to a stride. The parameters are:
Address: Base address in memory.
Total Length: Total length of the data to be scattered or gathered.
Length: Size of each data segment.
Stride: Offset between successive data segments to be accessed.
In the example shown in Table 4, the descriptor is a pattern that combines pointers and data accessed according to a stride. The descriptor can repeat a pattern multiple times.
The parameters are:
Address: Base address in memory.
Total Length: Total length to scatter or gather.
Stride: Offset between successive writes.
Length: size of every data segment. (The stride is the offset between these segments.) Last—if set—last pointer in descriptor, if clear—next segment is another pointer.
Multiple: repeat (from start of descriptor)—number of times to repeat pattern.
Multiple Stride: Offset between every pattern.
PCIe Implementation.
A packet format for a PCIe implementation for efficient scatter-gather operations is shown in
Implementation Details.
For the case of DMA, in a Strided Gather Transaction (SGT) the read is encapsulated in a PCIe Vendor Defined Message (VDM), which travels along the PCIe fabric using posted credits (since VDMs are posted). However, when it arrives at the destination, it is treated as non-posted. It might happen that upon arrival of such a request there are no resources at the receiver side to handle a non-posted request. This might result in blocking of subsequent posted requests and eventually might result in deadlock. To avoid this situation several approaches are suggested:
1) Provide a separate buffer for SGTs—SGTB (Strided Gather Transactions Buffer). The number of its entries (each of the size of the maximum SGT entry) should be equal to the number of posted credits that the port advertises. Posted credit is released by the receiver upon handling a standard posted packet or upon popping of the SGT entry from the SGTB.
2) If responses for a SGT can always be treated as completions with relaxed ordering bit set (no need to require ordering between SGT completions and writes from the host to the device) then forward progress of the SGT will not depend on outbound posted progress. Therefore, deadlock is prevented (deadlock in which an inbound posted message depends on an outbound posted message). In addition, the receiver may implement an ability for posted transaction to bypass VDMs of SGTB type in order to preventing possible blocking or deadlock.
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof that are not in the prior art, which would occur to persons skilled in the art upon reading the foregoing description.
This Application claims the benefit of U.S. Provisional Application Nos. 62/585,567, filed 14 Nov. 2017 and 62/595,605 filed Dec. 7, 2017, which are herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
8255475 | Kagan et al. | Aug 2012 | B2 |
8645663 | Kagan et al. | Feb 2014 | B2 |
9456060 | Pope et al. | Sep 2016 | B2 |
20020165897 | Kagan | Nov 2002 | A1 |
20030120835 | Kale | Jun 2003 | A1 |
20040030745 | Boucher | Feb 2004 | A1 |
20050223118 | Tucker | Oct 2005 | A1 |
20070127525 | Sarangam | Jun 2007 | A1 |
20080219159 | Chateau | Sep 2008 | A1 |
20090240838 | Berg et al. | Sep 2009 | A1 |
20090296699 | Hefty | Dec 2009 | A1 |
20090327444 | Archer et al. | Dec 2009 | A1 |
20100274876 | Kagan et al. | Oct 2010 | A1 |
20100329275 | Johnsen et al. | Dec 2010 | A1 |
20130215904 | Zhou | Aug 2013 | A1 |
20130312011 | Kumar et al. | Nov 2013 | A1 |
20140040542 | Kim et al. | Feb 2014 | A1 |
20140136811 | Fleischer | May 2014 | A1 |
20150074373 | Sperber et al. | Mar 2015 | A1 |
20150193271 | Archer et al. | Jul 2015 | A1 |
20150261720 | Kagan et al. | Sep 2015 | A1 |
20150347012 | Dewitt et al. | Dec 2015 | A1 |
20160119244 | Wang et al. | Apr 2016 | A1 |
20160162188 | Padia | Jun 2016 | A1 |
20160283422 | Crupnicoff et al. | Sep 2016 | A1 |
20170192782 | Valentine et al. | Jul 2017 | A1 |
20170235702 | Horie | Aug 2017 | A1 |
20170308329 | A et al. | Oct 2017 | A1 |
20180052803 | Graham et al. | Feb 2018 | A1 |
Entry |
---|
Bruck et al., “Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems”, IEEE Transactions on Parallel and Distributed Systems, vol. 8, No. 11, pp. 1143-1156, Nov. 1997. |
Gainaru et al., “Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All”, EuroMPI '16, Edinburgh, United Kingdom, 13 pages, year 2016. |
MPI: A Message-Passing Interface Standard, Version 3.1, Message Passing Interface Forum, 868 pages, Jun. 4, 2015. |
Shattah et al, U.S. Appl. No. 15/996,548, filed Jun. 4, 2018. |
U.S. Appl. No. 15/681,390 office action dated Nov. 9, 2018. |
U.S. Appl. No. 15/681,390 office action dated May 14, 2019. |
Number | Date | Country | |
---|---|---|---|
20190149486 A1 | May 2019 | US |
Number | Date | Country | |
---|---|---|---|
62585567 | Nov 2017 | US | |
62595605 | Dec 2017 | US |