The present invention relates generally to data storage, and particularly to emulation of storage protocols in peripheral devices.
Various techniques for data storage using network adapters are known in the art. For example, U.S. Pat. Nos. 9,696,942 and 9,727,503 describe techniques for accessing remote storage devices using a local bus protocol. A disclosed method includes configuring a driver program on a host computer to receive commands in accordance with a protocol defined for accessing local storage devices connected to a peripheral component interface bus of the host computer. When the driver program receives, from an application program running on the host computer a storage access command in accordance with the protocol, specifying a storage transaction, a remote direct memory access (RDMA) operation is performed by a network interface controller (NIC) connected to the host computer so as to execute the storage transaction via a network on a remote storage device.
U.S. Pat. No. 10,657,077 describes a HyperConverged NVMF storage-NIC card. A storage and communication apparatus for plugging into a server, includes a circuit board, a bus interface, a Medium Access Control (MAC) processor, one or more storage devices and at least one Central Processing Unit (CPU). The bus interface is configured to connect the apparatus at least to a processor of the server. The MAC is mounted on the circuit board and is configured to connect to a communication network. The storage devices are mounted on the circuit board and are configured to store data. The CPU is mounted on the circuit board and is configured to expose the storage devices both (i) to the processor of the server via the bus interface, and (ii) indirectly to other servers over the communication network.
An embodiment of the present invention that is described herein provides a peripheral device including a host interface and processing circuitry. The host interface is to communicate with one or more hosts over a peripheral bus. The processing circuitry is to expose on the peripheral bus a peripheral-bus device that communicates with the one or more hosts using one or more instances of at least one bus storage protocol, to receive, using the exposed peripheral-bus device, Input/Output (I/O) transactions that are issued by the one or more hosts, and to complete the I/O transactions for the one or more hosts in accordance with one or more instances of at least one network storage protocol, by running at least part of a host-side protocol stack of the at least one network storage protocol.
In some embodiments, the processing circuitry is to expose multiple separate File Systems (FSs) to the one or more hosts, using the peripheral-bus device. In an example embodiment, the processing circuitry is to deduplicate data across at least two of the separate FSs. In another embodiment, the processing circuitry is to cache data for at least two of the separate FSs, in accordance with a caching policy that depends on usage of the data across the at least two of the separate FSs.
In disclosed embodiments, the processing circuitry is to complete the I/O transactions by storing data in a plurality of storage tiers. In an embodiment, the processing circuitry is to move at least part of the data among the storage tiers depending on usage of the data by the one or more hosts.
In some embodiments, the processing circuitry is to (i) receive, via the peripheral-bus device, a request from a host to send over a network to a remote host specified data, which was previously stored by the peripheral device in accordance with the network storage protocol, and (ii) in response to the request, transfer the previously-stored data over the network to the remote host, while offloading the host of transferal of the data.
In an embodiment, the processing circuitry is to transfer the specified data to the remote host by (i) fetching the data from a storage location in which the data was previously stored, and (ii) sending the fetched data to the remote host. In an embodiment, in transferring the previously-stored data, the previously-stored data is not transferred via the host. In another embodiment, the processing circuitry is to instruct a peer peripheral device, over the network, to send the previously-stored data to the remote host. In an example embodiment, in transferring the previously-stored data, the previously-stored data is not transferred via the peripheral device.
There is additionally provided, in accordance with an embodiment, a method including, in a peripheral device, communicating with one or more hosts over a peripheral bus. A peripheral-bus device, which communicates with the one or more hosts using one or more instances of at least one bus storage protocol, is exposed on the peripheral bus using the peripheral device. Input/Output (I/O) transactions, which are issued by the one or more hosts, are received in the peripheral device using the exposed peripheral-bus device. The I/O transactions are completed for the one or more hosts by the peripheral device, in accordance with one or more instances of at least one network storage protocol, by running at least part of a host-side protocol stack of the at least one network storage protocol.
There is also provided, in accordance with an embodiment, a method for emulating a storage protocol in a peripheral device. The method includes, using a peripheral device that is connected to one or more hosts by a peripheral bus, exposing on the peripheral bus a dedicated peripheral-bus device that communicates with the hosts using at least one bus storage protocol. Input/Output (I/O) transactions, which are issued by the hosts, are received in the peripheral device using the exposed peripheral-bus device. The I/O transactions are completed for the hosts, by the peripheral device, in accordance with at least one network storage protocol, by running at least part of a host-side protocol stack of the at least one network storage protocol.
Another embodiment of the present invention that is described herein provides a peripheral device including a host interface and processing circuitry. The host interface is configured to communicate with a host over a peripheral bus. The processing circuitry is configured to expose on the peripheral bus a peripheral-bus device that communicates with the host using a bus storage protocol, to receive, using the exposed peripheral-bus device, Input/Output (I/O) transactions that are issued by the host, and to complete the I/O transactions for the host in accordance with a network storage protocol, by running at least part of a host-side protocol stack of the network storage protocol.
In some embodiments, in running at least part of the host-side protocol stack, the processing circuitry is configured to isolate the host from control-plane operations of the network storage protocol. In an embodiment, the processing circuitry is configured to complete at least some, or at least part, of the I/O transactions for the host in a local storage. In another embodiment, the peripheral device further includes a network port configured to communicate over a network, and the processing circuitry is configured to complete at least some, or at least part, of the I/O transactions for the host by communicating over the network with a storage system that operates in accordance with the network storage protocol.
In some embodiments, in completing an I/O transaction over the network, the processing circuitry is configured to transfer data directly between a memory of the host and the storage system using zero-copy transfer. In an example embodiment, the processing circuitry is configured to determine one or more addresses for the data in the storage system, and then to transfer the data directly between the one or more addresses and the memory of the host, without intermediate storage of the data in the network adapter.
In a disclosed embodiment, at least one of the bus storage protocol and the network storage protocol is a block storage protocol. In another embodiment, at least one of the bus storage protocol and the network storage protocol is a File-System (FS) protocol. In yet another embodiment, at least one of the bus storage protocol and the network storage protocol is an object storage protocol. In still another embodiment, at least one of the bus storage protocol and the network storage protocol is a Key-Value (KV) protocol.
In an embodiment, in exposing the peripheral-bus device, the processing circuitry is configured to emulate a hot-plug indication, notifying the host that a storage device has connected to the peripheral bus. In some embodiments, the processing circuitry is configured to receive from the host a doorbell indicative of a queue on which the host posted one or more work requests pertaining to an I/O transaction, and to read and execute the one or more work requests so as to complete the I/O transaction. In an example embodiment, the processing circuitry includes hardware that is configured to receive the doorbell and to read the one or more work requests from the queue in response to the doorbell.
In another embodiment, the processing circuitry is configured to issue a Message Signaled Interrupt to the host upon completing an I/O transaction. In a disclosed embodiment, the processing circuitry is configured to communicate with the host via one or more registers exposed on the peripheral bus. In another embodiment, the processing circuitry is further configured to perform one or more of storage virtualization and data manipulation operations.
There is additionally provided, in accordance with an embodiment of the present invention, a method including, in a peripheral device that communicates with a host over a peripheral bus, exposing on the peripheral bus a peripheral-bus device that communicates with the host using a bus storage protocol.
Input/Output (I/O) transactions, which are issued by the host, are received in the peripheral device using the exposed peripheral-bus device. The I/O transactions are completed for the host, by the peripheral device, in accordance with a network storage protocol, by running at least part of a host-side protocol stack of the network storage protocol.
The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
Embodiments of the present invention that are described herein provide improved data storage techniques in which a peripheral device provides storage services to a host over a peripheral bus. The host may comprise, for example, a server in a data center. The peripheral may comprise, for example, a high-performance network adapter, sometimes referred to as Data Processing Unit (DPU) or “Smart-NIC”. The embodiments described herein refer mainly to a DPU that provides storage services to a host over a Peripheral Component Interconnect express (PCIe) bus. Generally, however, the disclosed techniques are applicable to various other types of peripherals and buses.
In the disclosed embodiments, the peripheral device serves the host using a network storage protocol, e.g., a block storage protocol, a File-System (FS) protocol, an object storage protocol or a Key-Value (KV) storage protocol. In particular, the peripheral device (i) exposes to the host a dedicated PCIe device that emulates a bus storage protocol, and (ii) runs at least part of the host-side protocol stack of the network storage protocol. The peripheral device receives Input/Output (I/O) transactions that are issued by the host, and completes the I/O transactions for the host, in accordance with the network storage protocol, using the internally-run protocol stack.
In the present context, the phrase “exposing a peripheral-bus device on the peripheral bus” means running an interface that is emulated toward the host to appear as a different peripheral-bus device, for the host to communicate with. The peripheral-bus device is typically a network-type device (as opposed, for example, to a storage-type device).
When using the disclosed techniques, the host is completely isolated from the control plane (also referred to as management plane or orchestration plane) of the storage service. The dedicated PCIe device presents to the host a storage interface, which is by nature more specific and restricted than a network interface. The host's interaction with the storage service is confined to data-plane storage operations, i.e., to exchanging I/O transactions with the dedicated PCIe device. Communication between the host and the dedicated PCIe device is typically implemented using a limited set of commands and virtually no security privileges. Control and management operations relating to storage services, for example login, management of identities, credentials and access privileges and other security-related operations, are carried out between the peripheral device (e.g., DPU) and any relevant (remote and/or local) storage system. The host, and therefore any untrusted software that might run on it, is completely isolated from these operations.
In some embodiments, being exposed only to the bus storage protocol, the host may be unaware of the type of network storage protocol used by the peripheral device. As such, it is even possible for the bus storage protocol and the network storage protocol to be of different types. For example, the dedicated PCIe device receive I/O transactions from the host in accordance with a File-System protocol, and complete the I/O transactions over the network in accordance with an object or Key-Value protocol.
As is evident from the description above, the disclosed architecture provides a high degree of security to the storage service. Isolating the host from the management and control of the storage service is important in many applications and use-cases. One example is a multi-tenant cloud application, in which the host does not always have control over the different applications it runs. Another example is a “bare metal” cloud application, in which a tenant is provided with full access privileges to the host. In such scenarios, the disclosed technique enables a storage provider to provide storage services to the various applications running on the host, in a well-controlled, secure and mutually-isolated manner.
The disclosed technique also improves performance, since the host is offloaded of most, if not all, of the network storage protocol stack. Storage tasks often exhibit unpredictable bursts of computational load, e.g., due to complex operations such as manipulation of metadata structures, garbage collection, data compaction and defragmentation. Some computational tasks that may be carried out by the host, e.g., some High-Performance Computing (HPC) workloads, are sensitive to such variations in computational load. Offloading the host-side protocol stack to a peripheral device is therefore highly advantageous.
Moreover, when using the disclosed techniques, maintenance and administration of the network storage protocol stack (e.g., installation, upgrade and configuration) are performed entirely within the peripheral device (e.g., DPU). No cooperation or awareness is required from the host or the host administrator in performing such actions.
Several example implementations and use-cases of the disclosed techniques are described herein. Complementary techniques, such as zero-copy completion of I/O transactions and special-purpose doorbell mechanisms, are also described.
Additional disclosed embodiments provide support for multiple separate filesystems, e.g., for use in multi-tenant hosts. Yet other embodiments provide multi-tier storage in a manner that is transparent to the host, and acceleration and offloading of file-transfer commands such as ‘sendfile’.
In the present context, DPU 24 is regarded as a peripheral device connected to PCIe bus 36. DPU 24 provides host 28 with data storage services, possibly among other tasks. In the example of
Host 28 comprises a host CPU 40 that may run various software applications depending on the applicable use-case. In one embodiment, host 40 comprises a server in a cloud-based data center, which hosts applications belonging to multiple customers (“tenants”). In another embodiment, host 40 comprises a server in a “bare metal” data center, in which a tenant “owns” the server, in the sense that the tenant is given full access privileges to the server.
Among other functions, the applications running on host CPU 40 issue Input/Output (I/O) transactions, e.g., transactions that write data to files, read data from files, or create, modify or delete files or directories. Generally, I/O transactions can be issued by any software that runs on host CPU 40, e.g., by Virtual Machines (VMs), processes, containers, or any other software.
In the present example, DPU 24 comprises a host interface 44 for communicating with host 28 over PCIe bus 36, a network port 48 for communicating with FS 32 over the network (e.g., using Ethernet packets), and processing circuitry 52 for carrying out the various networking and storage functions of the DPU. Processing circuitry 52 typically comprises one or more CPUs 56 that run suitable software, and dedicated hardware 60. The tasks of processing circuitry 52 may be partitioned between software (CPUs 56) and hardware (dedicated hardware 60) in any suitable way.
In some embodiments, processing circuitry 52 provides storage services to host CPU 40 by running at least part of the host-side protocol stack of the network storage protocol of FS 32. In addition, processing circuitry 52 exposes to host CPU 40 a dedicated PCIe device 62 on PCIe bus 36. In some embodiments, processing circuitry 52 may perform additional processing that enhances the specified network storage protocol of FS 32. For example, if the network storage protocol does not provide cryptographic capabilities, processing circuitry 52 of DPU 24 may add this functionality on top of the network storage protocol.
For the sake of clarity, PCIe device 62 is depicted in the figure inside host 28, in order to emphasize the interaction between device 62 and host CPU 40. In reality, however, PCIe device 62 is a logical interface presented to host 28 by DPU 24 over bus 36. The terms “PCIe device” and “PCIe interface” can therefore be used interchangeably. PCIe device 62 may comprise a PCIe physical function or virtual function.
PCIe device 62 is configured to emulate a bus storage protocol vis-à-vis the host CPU. Host CPU 40 conducts the I/O transactions by communicating with PCIe device 62 using the bus storage protocol. Processing circuitry 52 of DPU 24 completes (i.e., executes) the I/O transactions for host CPU 40 in FS 32 (and/or in local storage as elaborated below), using the internally-run protocol stack of the network storage protocol. Host interaction with PCIe device 62 may be implemented using standard operating-system (OS) drivers, or as a vendor specific driver, as appropriate.
The protocol between PCIe device 62 and host CPU 40 is typically limited to a small dedicated set of storage-related commands, as opposed to arbitrary communication enabled by conventional network devices. Therefore, the security vulnerability of this protocol is considerably reduced, and the task of securing it is significantly simpler. For example, processing circuitry 56 in DPU 24 may analyze the transactions arriving via PCIe device 62 and apply a security policy that is specified per the storage protocol being used. For example, the security policy may examine attributes relating to the storage protocol (e.g., filenames, offsets, object identifiers and the like) and take actions depending of the attribute values. Actions may comprise, for example, permitting or denying access to certain files or objects, or any other suitable action.
In some embodiments, DPU 24 further comprises local storage 64, e.g., one or more Flash memory devices. In some embodiments, system 20 further comprises an additional peripheral device 68 that comprises local storage 72, e.g., one or more Flash memory devices. The additional peripheral device may be, for example, another NIC (DPU or otherwise) or a Solid State Drive (SSD). In some embodiments, completion of I/O transactions may involve storing data in local storage, e.g., storage 64 in DPU 24 or storage 72 in additional peripheral device 68.
Generally, DPU 24 may complete at least some, or at least part, of the I/O transactions over the network, and may complete at least some, or at least part, of the I/O transactions in the local storage. Thus, for example, the protocol stack running in the DPU may translate a given I/O transaction into multiple storage operations (read or write) of the network storage protocol. One or more of these storage operations may be performed over the network, and one or more of the storage operations may be performed in the local storage.
The configurations of system 20 and its components, e.g., DPU 24 and host 28, shown in
For example, in alternative embodiments, the peripheral device that exposes PCIe device 62 and runs the host-side network storage protocol stack may have no network connection at all. In an example embodiment of this sort, the peripheral device is a storage device such as an SSD. In such embodiments, data storage for the host is performed locally with no network communication. Further alternatively, the peripheral device may perform both local storage and remote storage.
Moreover, the disclosed techniques are not limited to file-system protocols, or to any other type of storage protocol. In alternative embodiments, the bus storage protocol may comprise various other types of storage protocols. Example bus storage protocols include block-storage (“block device”) protocols such as NVMe, virtio-blk, SCSI, SATA and SAS, various object storage protocols, KV storage protocols such as NVMe-KV, or any other suitable protocol. Alternatively to FS 32, DPU 24 may complete I/O transactions using various network storage protocols, e.g., block-storage protocols such as NVMe-over-Fabrics, NVMe-over-TCP, iSCSI, iSER, SRP and Fibre-channel, object storage protocols such as Amazon S3, Microsoft Azure, OpenStack Swift and Google Cloud Storage, KV storage protocols such as NoSQL, Redis and RocksDB, or any other suitable storage system or protocol.
As noted above, it is not mandatory that the bus storage protocol (exposed toward the host) and the network storage protocol (used for transaction completion) be of the same type. For example, in an embodiment, the bus storage protocol may comprise a FS protocol, while the network storage protocol comprises an object or KV protocol. Any other suitable combination can also be used.
The various elements of system 20 and its components, e.g., DPU 24 and host 28, may be implemented using software, using suitable hardware such as in one or more Application-Specific Integrated Circuits (ASIC) or Field-Programmable Gate Arrays (FPGA), or using a combination of software and hardware elements.
Typically, host CPU 40 and CPUs 56 of DPU 24 comprises programmable processors, which are programmed in software to carry out the functions described herein. The software may be downloaded to any of the processors in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
In the example of
In an embodiment, Virtio-fs driver 88 presents to host 28 a local directory to mount, which is mapped to a directory in DPU 24. The protocol between DPU 24 and Virtio-fs driver 88 is defined in the virtio-fs specification, cited above, and is based on FUSE commands delivered over virtio-queues. In the more general case, virtio-fs SNAP controller 80 and file system driver 84 are tightly coupled to one another, and the virtio-fs folder presented to the host originates directly from the network without a local DPU folder representing it.
At an I/O execution step 98, processing circuitry 56 of DPU 24 executes the I/O transactions for the host, in the appropriate storage system, in accordance with the network storage protocol. At a completion step 102, when the I/O transaction is completed, processing circuitry 56 of DPU 24 sends a completion notification to host 28.
As can be appreciated, completing an I/O transaction by DPU 24 involves transfer of data between the memory of host 28 and a memory of the storage system managed by FS 32. When completing a write command, for example, processing circuitry 56 of DPU 24 transfers data from the memory of host 28 to the memory of the storage system. When completing a read command, processing circuitry 56 transfers data in the opposite direction, from the memory of the storage system to the memory of host 28.
In some embodiments, processing circuitry 56 performs these data transfers in a “zero-copy” manner. In the present context, the term “zero-copy” means that the data is transferred directly between the memory of host to the memory of the storage system, without intermediate storage in DPU 24. Zero-copy completion of I/O transactions significantly reduces the overall transaction latency, and increases the achievable throughput.
Typically, the data transfer is performed using Remote Direct Memory Access (RDMA). In some embodiments, processing circuitry 56 performs zero-copy data transfer in two stages. In the first stage, processing circuitry 56 determines the appropriate address or addresses in the storage system for completing the I/O transaction (the address or addresses to which the data is to be written in case of a write command, or to be read from in case of a read command). Only then, in the second stage, processing circuitry 56 transfers the data between the appropriate addresses in the memory of host 28 and in the memory of the storage system.
In some embodiments, processing circuitry 56 performs zero-copy data transfer by accessing the memory of host 28 directly, using the host's own address space. Techniques of this sort are disclosed in U.S. patent application Ser. No. 17/189,303, entitled “Cross Address-Space Bridging,” filed Mar. 2, 2021, whose disclosure is incorporated herein by reference. In an embodiment of this sort, processing circuitry 56 of DPU 24 creates an RDMA MKEY that describes a memory of host 28. In this manner, an RDMA operation can be performed directly between the memory of host 28 and the memory of the storage system (a network entity), eliminating the need for an extra copy to the DPU memory.
In alternative embodiments, processing circuitry 56 may perform zero-copy data transfer in any other suitable way.
In various embodiments, DPU 24 comprises various hardware or hardware-software mechanisms that enhance the flexibility of receiving and handling I/O transactions issued by host 28, and also reduces latency. Such mechanisms may comprise, for example, queues and corresponding doorbells, hardware registers, interrupts and the like. Several examples are given below.
Typically, host CPU 40 issues an I/O transaction by posting one or more work requests on a queue that can be read by processing circuitry 56 of DPU 24. The host and DPU may interact via multiple queues in parallel, e.g., a queue per host core, per application, per thread, per QoS class, per user or per tenant. In order to reduce latency, in some embodiments the host and the DPU use a doorbell mechanism, in which host processor 40 (i) signals to processing circuitry 56 that one or more work requests have been posted, and (ii) indicates the queue from which the DPU should read the work requests. Typically, the doorbell triggers hardware 60 in processing circuitry 52 to read the work requests from the specified queue and pass the work requests to CPUs 56 for processing.
In an example embodiment, hardware 60 is configured to regard one or more addresses on the PCIe Base Address Register (BAR) as doorbells. The BAR is exposed to the host via dedicated PCIe device 62. In this embodiment, host CPU 40 issues a doorbell by writing to one of these addresses. Such a write triggers hardware 60 in the DPU, which in turn reads any pending work requests from the specified queue.
Various techniques can be used for specifying the identity of the queue to be read. In one embodiment, a single BAR address is assigned to serve as a doorbell, and the host writes the appropriate queue identifier to this address. In another embodiment, each queue is assigned a respective different BAR address; any write to one of these BAR addresses is interpreted by hardware 60 as a doorbell for the corresponding queue. The value written to the address can be interpreted as the producer index of the queue. Alternatively, any other suitable mechanism or convention can be used.
In some embodiments, the DPU software, running on CPUs 56, is configured to issue a Message Signaled Interrupt (MSI or MSI-X) to the host upon completing an I/O transaction. The interrupt triggers host CPU 40, and therefore reduces latency.
In some embodiments, CPUs 56 (in DPU 24) and host processor 40 (in host 28) are configured to exchange information and/or report events to one another by writing and reading registers defined on the PCIe bus BAR. These registers are exposed to the host via dedicated PCIe device 62. In some cases CPUs 56 (in DPU 24) write to a register and host processor 40 (in host 28) reads the register. In other cases host processor 40 (in host 28) writes to a register and CPUs 56 (in DPU 24) read the register. In some embodiments more complex register mechanisms can be defined. For example, writing to one register can affect the meaning of a subsequent write to another register.
In some embodiments, in exposing dedicated PCIe device 62, processing circuitry 52 is configured to emulate a “hot-plug” indication to host 28. The hot-plug indication notifies the host that a storage device has connected to PCIe bus 36.
In some embodiments, as part of emulating the storage protocol to the host, processing circuitry 52 in DPU 24 is configured to emulate various FS services to host 28. Any suitable FS service can be emulated, such as, for example, directory services and statistics collection.
Additionally or alternatively, as part of emulating the storage protocol to the host, processing circuitry 52 in DPU 24 is configured to perform one or more storage virtualization and data manipulation operations. Storage virtualization operations that may be performed by DPU 24 comprise, for example, cryptographic operations such as encryption, decryption, signing and authentication, deduplication, mirroring, isolation for security, Quality of Service (QoS), directory service, locking, compression, Artificial Intelligence (AI) operations, among others. In some embodiments, such operations can be carried out, at least in part, by hardware 60 in processing circuitry 52, and/or accelerated by a Graphics Processing Unit (GPU) coupled to DPU 24.
Efficient Emulation of Multiple Filesystems
In some embodiments, the processing circuitry of the disclosed peripheral device (e.g., DPU) exposes multiple separate filesystems (FSs) to the (one or more) hosts it serves using the dedicated storage PCIe device. In some embodiments, the peripheral device supports multiple separate FSs efficiently by performing joint de-duplication and/or caching across the various FSs.
In these embodiments, the dedicated PCIe storage device (in the present example dedicated virtio-fs PCIe device 88), which is exposed by DPU 24 over PCIe bus 36, exposes multiple separate FSs to apps/guests 106.
Thus, each app/guest 106 is able to send I/O transactions to the filesystem it is configured to use, using the appropriate bus storage protocol. The processing circuitry of DPU 24 runs a virtio-fs emulation module 110 that, among other emulation tasks, translates between the I/O transactions of the bus storage protocol and I/O transactions of the corresponding network storage protocol. As in the case of a single FS discussed above, DPU 24 communicates with each app/guest 106 using the bus storage protocol, and with the corresponding FS 32 using the network storage protocol. For this purpose, DPU 24 of
In practice, it is quite possible that different apps/guests store the same data in their respective FSs. A naïve implementation would be to disregard these commonalities, but this simplification may lead to degraded performance. A considerably more efficient solution is to perform de-duplication across multiple FSs 32. Another way of gaining storage efficiency is to perform caching across multiple FSs 32. In the embodiment of
In an example embodiment, module 114 identifies identical data items that are used by two or more of FSs 32, and de-duplicates this data. In the present context, the term “de-duplication” means storing fewer copies of the identified data (fewer than the number of filesystems 32 that use this data), e.g., only a single copy. Module 114 makes these fewer copies (e.g., single copy) available to the various FSs 32. Any data that is used by multiple filesystems can be de-duplicated in this manner, e.g., user data, user metadata such as data structures, and/or objects or other information of the filesystems themselves. Module 114 may de-duplicate data across FSs even when the FSs are of different types. Any suitable de-duplication scheme can be used for this purpose.
Additionally or alternatively, in some embodiments module 114 cache data for at least two of the separate FSs 32, in accordance with a caching policy that depends on the usage of the data across the multiple FSs. For example, the caching policy may give high priority in caching to Most Frequently Used (MFU) or Most Recently Used (MRU) data items, and/or evict from the cache Least Frequently Used (LFU) or Least Recently Used (LRU) data items. When evaluating such criteria, module 114 calculates the usage frequency, and/or records the usage times, across the multiple FSs rather than separately for each FS. Alternatively, any other suitable caching policy can be used.
In addition, the same cache (e.g., single-level or multi-level cache memory) is used for jointly serving the multiple FSs. Joint caching of this sort makes a considerably more efficient use of the available caching resources, relative to independent caching per individual FS.
In some embodiments, the processing circuitry of the disclosed peripheral device (e.g., DPU) exposes a certain FS to the (one or more) hosts it serves using the dedicated storage PCIe device. Actual storage of the data, however, is performed by the DPU in multiple storage tiers. The multi-tier storage is typically transparent to the apps/guests that store the data.
An example four-tier storage scheme may use the following tiers, ordered from fastest to slowest:
In the example of
In addition to emulation module 110, the processing circuitry of DPU 24 runs a tiered-FS driver 122 that carries out the tiered storage in the various storage tiers. Among other tasks, driver 122 tracks the usage of data items by the (one or more) hosts, chooses the appropriate storage tier for each data item based on the tracked usage, and moves data from tier to tier accordingly. Typically, driver 122 will store frequently-accessed data (“hot data”) in storage tiers that are closer to the host, and infrequently-accessed data (“cold data”) in storage tiers that are further away to the host. Since data usage patterns may change over time, driver 122 may adapt to these changes by moving data from one tier to another. Apps/guests 106 are typically unaware of the tiered structure and of the actual storage locations of the various data items.
In some practical cases, a host may request to transfer data (e.g., a file) that is stored in a filesystem to a remote host. One example is the Linux ‘sendfile’ command. A naive implementation of sendfile would be to read the file from the FS to the host, and then write the file from the host, over the network, to the requested remote host. In some embodiments, the disclosed peripheral device (e.g., DPU) carries out sendfile commands on behalf of the host, thereby offloading the host. In addition to reducing the processing load on the host, the disclosed technique also reduces the amount of data transfer to and from the host.
To perform storage emulation, the processing circuitry of DPU 24 comprises a virtio-fs emulation module 110, and exposes to the host a dedicated virtio-fs device 88. To perform network emulation, the processing circuitry of DPU 24 comprises a virtio-transport emulation module 142, and exposes to the host a dedicated virtio-net device 134.
Virtio-fs emulation module 110 communicates with a networked FS 32 as described above. Virtio-transport emulation module 142 is connected to the Internet 146, for communicating (among other network emulation tasks) with remote hosts that are destinations of sendfile commands. In addition, the processing circuitry of DPU 24 comprises a sendfile acceleration module 138 that performs sendfile acceleration and offloading.
In an example embodiment, the process of transferring a file in an offloaded manner begins with an app/guest 106 on host 28 sending a suitable request to virtio-fs device 88. This request is referred to herein as “offloaded sendfile” and is distinct from conventional sendfile commands. The “offloaded sendfile” request specifies (i) a file that was previously stored in FS 32, and (ii) a remote host to which the file is to be sent.
Virtio-fs device 88 transfers the request over PCIe bus 36 to sendfile acceleration module 138 in DPU 24. In response to the request, sendfile acceleration module 138 retrieves the file from FS 32 using virtio-fs emulation module 110, and sends the file to the remote host via virtio-transport emulation module 142. As seen, throughout this process, the file data does not traverse PCIe bus 36 and does not pass via host 28.
In some embodiments, sendfile acceleration module 138 in DPU 24 may delegate the execution of the “offloaded sendfile” to another DPU 24. For example, module 138 may instruct a peer DPU, which is closer to the remote host that is the final destination of the file, to transfer the file. In such embodiments, the peer DPU transfers the file directly to the remote host. The file data does not pass through the host that initiated the “offloaded sendfile”, nor does it pass through the DPU that servers the initiating host.
The description above refers to transfer of files and to sendfile commands, by way of example. In alternative embodiments, the disclosed technique can be used for transferring any other suitable type of data.
It will be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
This application is a continuation-in-part of U.S. patent application Ser. No. 17/211,928, filed Mar. 25, 2021, whose disclosure is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5003465 | Chisholm et al. | Mar 1991 | A |
5463772 | Thompson | Oct 1995 | A |
5615404 | Knoll et al. | Mar 1997 | A |
5768612 | Nelson | Jun 1998 | A |
5864876 | Rossum et al. | Jan 1999 | A |
5893166 | Frank et al. | Apr 1999 | A |
5954802 | Griffith | Sep 1999 | A |
6070219 | McAlpine et al. | May 2000 | A |
6226680 | Boucher | May 2001 | B1 |
6321276 | Forin | Nov 2001 | B1 |
6581130 | Brinkmann et al. | Jun 2003 | B1 |
6701405 | Adusumilli et al. | Mar 2004 | B1 |
6766467 | Neal et al. | Jul 2004 | B1 |
6789143 | Craddock et al. | Sep 2004 | B2 |
6901496 | Mukund et al. | May 2005 | B1 |
6981027 | Gallo et al. | Dec 2005 | B1 |
7171484 | Krause et al. | Jan 2007 | B1 |
7225277 | Johns et al. | May 2007 | B2 |
7263103 | Kagan et al. | Aug 2007 | B2 |
7299266 | Boyd et al. | Nov 2007 | B2 |
7395364 | Higuchi et al. | Jul 2008 | B2 |
7464198 | Martinez et al. | Dec 2008 | B2 |
7475398 | Nunoe | Jan 2009 | B2 |
7548999 | Haertel et al. | Jun 2009 | B2 |
7577773 | Gandhi et al. | Aug 2009 | B1 |
7657659 | Lambeth et al. | Feb 2010 | B1 |
7720064 | Rohde | May 2010 | B1 |
7752417 | Manczak et al. | Jul 2010 | B2 |
7809923 | Hummel et al. | Oct 2010 | B2 |
7921178 | Haviv | Apr 2011 | B2 |
7921237 | Holland et al. | Apr 2011 | B1 |
7945752 | Miller et al. | May 2011 | B1 |
8001592 | Hatakeyama | Aug 2011 | B2 |
8006297 | Johnson et al. | Aug 2011 | B2 |
8010763 | Armstrong et al. | Aug 2011 | B2 |
8051212 | Kagan et al. | Nov 2011 | B2 |
8103785 | Crowley et al. | Jan 2012 | B2 |
8255475 | Kagan et al. | Aug 2012 | B2 |
8260980 | Weber et al. | Sep 2012 | B2 |
8346919 | Eiriksson | Jan 2013 | B1 |
8447904 | Riddoch | May 2013 | B2 |
8504780 | Mine et al. | Aug 2013 | B2 |
8645663 | Kagan et al. | Feb 2014 | B2 |
8745276 | Bloch et al. | Jun 2014 | B2 |
8751701 | Shahar et al. | Jun 2014 | B2 |
8824492 | Wang et al. | Sep 2014 | B2 |
8949486 | Kagan et al. | Feb 2015 | B1 |
9038073 | Kohlenz et al. | May 2015 | B2 |
9092426 | Bathija et al. | Jul 2015 | B1 |
9298723 | Vincent | Mar 2016 | B1 |
9678818 | Raikin et al. | Jun 2017 | B2 |
9696942 | Kagan et al. | Jul 2017 | B2 |
9727503 | Kagan et al. | Aug 2017 | B2 |
9830082 | Srinivasan et al. | Nov 2017 | B1 |
9904568 | Vincent et al. | Feb 2018 | B2 |
10078613 | Ramey | Sep 2018 | B1 |
10120832 | Raindel et al. | Nov 2018 | B2 |
10135739 | Raindel et al. | Nov 2018 | B2 |
10152441 | Liss et al. | Dec 2018 | B2 |
10162793 | Bshara et al. | Dec 2018 | B1 |
10210125 | Burstein | Feb 2019 | B2 |
10218645 | Raindel et al. | Feb 2019 | B2 |
10423774 | Zelenov et al. | Apr 2019 | B1 |
10382350 | Bohrer et al. | Aug 2019 | B2 |
10657077 | Ganor et al. | May 2020 | B2 |
10671309 | Glynn | Jun 2020 | B1 |
10684973 | Connor et al. | Jun 2020 | B2 |
10715451 | Raindel et al. | Jul 2020 | B2 |
10824469 | Hirshberg et al. | Nov 2020 | B2 |
10841243 | Levi et al. | Nov 2020 | B2 |
10999364 | Itigin et al. | May 2021 | B1 |
11003607 | Ganor et al. | May 2021 | B2 |
11086713 | Sapuntzakis et al. | Aug 2021 | B1 |
20020152327 | Kagan et al. | Oct 2002 | A1 |
20030023846 | Krishna et al. | Jan 2003 | A1 |
20030046530 | Poznanovic | Mar 2003 | A1 |
20030120836 | Gordon | Jun 2003 | A1 |
20040010612 | Pandya | Jan 2004 | A1 |
20040039940 | Cox et al. | Feb 2004 | A1 |
20040057434 | Poon et al. | Mar 2004 | A1 |
20040158710 | Buer et al. | Aug 2004 | A1 |
20040221128 | Beecroft et al. | Nov 2004 | A1 |
20040230979 | Beecroft et al. | Nov 2004 | A1 |
20050102497 | Buer | May 2005 | A1 |
20050198412 | Pedersen et al. | Sep 2005 | A1 |
20050216552 | Fineberg et al. | Sep 2005 | A1 |
20060095754 | Hyder et al. | May 2006 | A1 |
20060104308 | Pinkerton et al. | May 2006 | A1 |
20060259291 | Dunham et al. | Nov 2006 | A1 |
20060259661 | Feng et al. | Nov 2006 | A1 |
20070011429 | Sangili et al. | Jan 2007 | A1 |
20070061492 | Van Riel | Mar 2007 | A1 |
20070223472 | Tachibana et al. | Sep 2007 | A1 |
20070226450 | Engbersen et al. | Sep 2007 | A1 |
20070283124 | Menczak et al. | Dec 2007 | A1 |
20070297453 | Niinomi | Dec 2007 | A1 |
20080005387 | Mutaguchi | Jan 2008 | A1 |
20080147822 | Benhase et al. | Jun 2008 | A1 |
20080147904 | Freimuth et al. | Jun 2008 | A1 |
20080168479 | Purtell et al. | Jul 2008 | A1 |
20080313364 | Flynn et al. | Dec 2008 | A1 |
20090086736 | Foong et al. | Apr 2009 | A1 |
20090106771 | Benner et al. | Apr 2009 | A1 |
20090204650 | Wong | Aug 2009 | A1 |
20090319775 | Buer et al. | Dec 2009 | A1 |
20090328170 | Williams et al. | Dec 2009 | A1 |
20100030975 | Murray et al. | Feb 2010 | A1 |
20100095053 | Bruce et al. | Apr 2010 | A1 |
20100095085 | Hummel et al. | Apr 2010 | A1 |
20100211834 | Asnaashari et al. | Aug 2010 | A1 |
20100217916 | Gao et al. | Aug 2010 | A1 |
20100228962 | Simon et al. | Sep 2010 | A1 |
20100322265 | Gopinath et al. | Dec 2010 | A1 |
20110023027 | Kegel et al. | Jan 2011 | A1 |
20110119673 | Bloch et al. | May 2011 | A1 |
20110213854 | Haviv | Sep 2011 | A1 |
20110246597 | Swanson et al. | Oct 2011 | A1 |
20120314709 | Post et al. | Dec 2012 | A1 |
20130067193 | Kagan et al. | Mar 2013 | A1 |
20130080651 | Pope et al. | Mar 2013 | A1 |
20130103777 | Kagan et al. | Apr 2013 | A1 |
20130125125 | Karino et al. | May 2013 | A1 |
20130142205 | Munoz | Jun 2013 | A1 |
20130145035 | Pope et al. | Jun 2013 | A1 |
20130159568 | Shahar et al. | Jun 2013 | A1 |
20130263247 | Jungck et al. | Oct 2013 | A1 |
20130276133 | Hodges et al. | Oct 2013 | A1 |
20130311746 | Raindel et al. | Nov 2013 | A1 |
20130325998 | Hormuth et al. | Dec 2013 | A1 |
20130329557 | Petry | Dec 2013 | A1 |
20130347110 | Dalal | Dec 2013 | A1 |
20140089450 | Raindel et al. | Mar 2014 | A1 |
20140089451 | Eran et al. | Mar 2014 | A1 |
20140089631 | King | Mar 2014 | A1 |
20140122828 | Kagan et al. | May 2014 | A1 |
20140129741 | Shahar et al. | May 2014 | A1 |
20140156894 | Tsirkin et al. | Jun 2014 | A1 |
20140181365 | Fanning et al. | Jun 2014 | A1 |
20140185616 | Bloch et al. | Jul 2014 | A1 |
20140254593 | Mital et al. | Sep 2014 | A1 |
20140282050 | Quinn et al. | Sep 2014 | A1 |
20140282561 | Holt et al. | Sep 2014 | A1 |
20150006663 | Huang | Jan 2015 | A1 |
20150012735 | Tamir et al. | Jan 2015 | A1 |
20150032835 | Sharp et al. | Jan 2015 | A1 |
20150081947 | Vucinic et al. | Mar 2015 | A1 |
20150100962 | Morita et al. | Apr 2015 | A1 |
20150288624 | Raindel et al. | Oct 2015 | A1 |
20150319243 | Hussain et al. | Nov 2015 | A1 |
20150347185 | Holt et al. | Dec 2015 | A1 |
20150355938 | Jokinen et al. | Dec 2015 | A1 |
20160065659 | Bloch et al. | Mar 2016 | A1 |
20160085718 | Huang | Mar 2016 | A1 |
20160132329 | Gupte et al. | May 2016 | A1 |
20160226822 | Zhang et al. | Aug 2016 | A1 |
20160342547 | Liss et al. | Nov 2016 | A1 |
20160350151 | Zou et al. | Dec 2016 | A1 |
20160378529 | Wen | Dec 2016 | A1 |
20170075855 | Sajeepa et al. | Mar 2017 | A1 |
20170104828 | Brown et al. | Apr 2017 | A1 |
20170180273 | Daly et al. | Jun 2017 | A1 |
20170187629 | Shalev et al. | Jun 2017 | A1 |
20170237672 | Dalal | Aug 2017 | A1 |
20170264622 | Cooper et al. | Sep 2017 | A1 |
20170286157 | Hasting et al. | Oct 2017 | A1 |
20170371835 | Ranadive et al. | Dec 2017 | A1 |
20180004954 | Liguori et al. | Jan 2018 | A1 |
20180067893 | Raindel et al. | Mar 2018 | A1 |
20180109471 | Chang et al. | Apr 2018 | A1 |
20180114013 | Sood et al. | Apr 2018 | A1 |
20180167364 | Dong et al. | Jun 2018 | A1 |
20180210751 | Pepus et al. | Jul 2018 | A1 |
20180219770 | Wu et al. | Aug 2018 | A1 |
20180219772 | Koster et al. | Aug 2018 | A1 |
20180246768 | Palermo et al. | Aug 2018 | A1 |
20180262468 | Kumar et al. | Sep 2018 | A1 |
20180285288 | Bernat et al. | Oct 2018 | A1 |
20180329828 | Apfelbaum et al. | Nov 2018 | A1 |
20190012350 | Sindhu et al. | Jan 2019 | A1 |
20190026157 | Suzuki et al. | Jan 2019 | A1 |
20190116127 | Pismenny et al. | Apr 2019 | A1 |
20190163364 | Gibb et al. | May 2019 | A1 |
20190173846 | Patterson et al. | Jun 2019 | A1 |
20190190892 | Menachem et al. | Jun 2019 | A1 |
20190199690 | Klein | Jun 2019 | A1 |
20190243781 | Thyamagondlu et al. | Aug 2019 | A1 |
20190250938 | Claes et al. | Aug 2019 | A1 |
20200012604 | Agarwal | Jan 2020 | A1 |
20200026656 | Liao et al. | Jan 2020 | A1 |
20200065269 | Balasubramani et al. | Feb 2020 | A1 |
20200259803 | Menachem et al. | Aug 2020 | A1 |
20200314181 | Eran et al. | Oct 2020 | A1 |
20200401440 | Sankaran et al. | Dec 2020 | A1 |
20210111996 | Pismenny et al. | Apr 2021 | A1 |
20210203610 | Pismenny et al. | Jul 2021 | A1 |
20220075747 | Shuler et al. | Mar 2022 | A1 |
20220100687 | Sahin et al. | Mar 2022 | A1 |
20220103629 | Cherian et al. | Mar 2022 | A1 |
Number | Date | Country |
---|---|---|
1657878 | May 2006 | EP |
2463782 | Jun 2012 | EP |
2010062679 | Jun 2010 | WO |
Entry |
---|
“Linux kernel enable the IOMMU—input/output memory management unit support”, pp. 1-2, Oct. 15, 2007 downloaded from http://www.cyberciti.biz/tips/howto-turn-on-linux-software-iommu-support.html. |
Hummel M., “IO Memory Management Hardware Goes Mainstream”, AMD Fellow, Computation Products Group, Microsoft WinHEC, pp. 1-7, 2006. |
PCI Express, Base Specification, Revision 3.0, pp. 1-860, Nov. 10, 2010. |
NVM Express, Revision 1.0e, pp. 1-127, Jan. 23, 2014. |
Infiniband Trade Association, “InfiniBandTM Architecture Specification”, vol. 1, Release 1.2.1, pp. 1-1727, Nov. 2007. |
Shah et al., “Direct Data Placement over Reliable Transports”, IETF Network Working Group, RFC 5041, pp. 1-38, Oct. 2007. |
Culley et al., “Marker PDU Aligned Framing for TCP Specification”, IETF Network Working Group, RFC 5044, pp. 1-75, Oct. 2007. |
“MPI: A Message-Passing Interface Standard”, Version 2.2, Message Passing Interface Forum, pp. 1-64, Sep. 4, 2009. |
Welsh et al., “Incorporating Memory Management into User-Level Network Interfaces”, Department of Computer Science, Cornell University, Technical Report TR97-1620, pp. 1-10, Feb. 13, 1997. |
Tsirkin et al., “Virtual I/O Device (Virtio) Version 1.1”, Committee Specification 01, OASIS, Section 5.11, pp. 156-160, Apr. 11, 2019. |
Burstein et al., U.S. Appl. No. 17/189,303, filed Mar. 2, 2021. |
U.S. Appl. No. 17/372,466 Office Action dated Feb. 15, 2023. |
Shirey, “Internet Security Glossary, Version 2”, Request for Comments 4949, pp. 1-365, Aug. 2007. |
Information Sciences Institute, “Transmission Control Protocol; DARPA Internet Program Protocol Specification”, Request for Comments 793, pp. 1-90, Sep. 1981. |
InfiniBand TM Architecture Specification vol. 1, Release 1.3, pp. 1-1842, Mar. 3, 2015. |
Stevens., “TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms”, Request for Comments 2001, pp. 1-6, Jan. 1997. |
Netronome Systems, Inc., “Open vSwitch Offload and Acceleration with Agilio® CX SmartNICs”, White Paper, pp. 1-7, Mar. 2017. |
Dierks et al., “The Transport Layer Security (TLS) Protocol Version 1.2”, Request for Comments: 5246 , pp. 1-104, Aug. 2008. |
Turner et al., “Prohibiting Secure Sockets Layer (SSL) Version 2.0”, Request for Comments: 6176, pp. 1-4, Mar. 2011. |
Rescorla et al., “The Transport Layer Security (TLS) Protocol Version 1.3”, Request for Comments: 8446, pp. 1-160, Aug. 2018. |
Comer., “Packet Classification: A Faster, More General Alternative to Demultiplexing”, The Internet Protocol Journal, vol. 15, No. 4, pp. 12-22, Dec. 2012. |
Salowey et al., “AES Galois Counter Mode (GCM) Cipher Suites for TLS”, Request for Comments: 5288, pp. 1-8, Aug. 2008. |
Burstein, “Enabling Remote Persistent Memory”, SNIA—PM Summit, pp. 1-24, Jan. 24, 2019. |
Chung et al., “Serving DNNs in Real Time at Datacenter Scale with Project Brainwave”, IEEE Micro Pre-Print, pp. 1-11, Mar. 22, 2018. |
Talpey, “Remote Persistent Memory—With Nothing But Net”, SNIA-Storage developer conference , pp. 1-30, year 2017. |
Microsoft, “Project Brainwave”, pp. 1-5, year 2019. |
NVM Express Inc., “NVM ExpressTM Base Specification”, Revision 1.4, p. 1-403, Jun. 10, 2019. |
Pismenny et al., “Autonomous NIC Offloads”, submitted for evaluation of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '21), p. 1-18, Dec. 13, 2020. |
Lebeane et al., “Extended Task queuing: Active Messages for Heterogeneous Systems”, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), pp. 933-944, Nov. 2016. |
NVM Express Inc., “NVM Express over Fabrics,” Revision 1.0, pp. 1-49, Jun. 5, 2016. |
Microchip Technology Incorporated, “Switchtec PAX Gen 4 Advanced Fabric PCIe Switch Family—PM42100, PM42068, PM42052, PM42036, PM42028,” Product Brochure, pp. 1-2, year 2021. |
Regula, “Using Non-Transparent Bridging in PCI Express Systems,” PLX Technology, Inc., pp. 1-31, Jun. 2004. |
U.S. Appl. No. 17/372,466 Office Action dated Nov. 2, 2022. |
U.S. Appl. No. 17/211,928 Office Action dated May 25, 2023. |
Mellanox Technologies, “Understanding On Demand Paging (ODP),” Knowledge Article, pp. 1-6, Feb. 20, 2019 downloaded from https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x. |
Rosenbaum et al., U.S. Appl. No. 17/338,131, filed Jun. 3, 2021. |
Duer et al., U.S. Appl. No. 17/211,928, filed Mar. 25, 2021. |
Ben-Ishay et al., U.S. Appl. No. 17/372,466, filed Jul. 11, 2021. |
Bar-Ilan et al, U.S. Appl. No. 17/234,189, filed Apr. 19, 2021. |
U.S. Appl. No. 17/979,013 Office Action dated Jan. 29, 2024. |
Number | Date | Country | |
---|---|---|---|
20220308764 A1 | Sep 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17211928 | Mar 2021 | US |
Child | 17527197 | US |