Methods and systems for storing data in a distributed system using GPUS

Information

  • Patent Grant
  • 12131074
  • Patent Number
    12,131,074
  • Date Filed
    Wednesday, October 27, 2021
    3 years ago
  • Date Issued
    Tuesday, October 29, 2024
    a month ago
Abstract
In general, embodiments relate to a method for storing data, the method comprising generating, by a memory hypervisor module executing on a client application node, at least one input/output (I/O) request, wherein the at least one I/O request specifies a location in a storage pool and a physical address of the data in a graphics processing unit (GPU) memory in a GPU on the client application node, wherein the location is determined using a data layout, and wherein the physical address is determined using a GPU module and issuing, by the memory hypervisor module, the at least one I/O request to the storage pool, wherein processing the at least one I/O request results in at least a portion of the data being stored at the location.
Description
BACKGROUND

Applications generate and/or manipulate large amounts of data. Thus, the performance of these applications is typically impacted by the manner in which the applications may read and/or write data.





BRIEF DESCRIPTION OF DRAWINGS


FIG. 1A shows a diagram of a system in accordance with one or more embodiments of the invention.



FIG. 1B shows a diagram of computer and storage infrastructure (CSI) in accordance with one or more embodiments of the invention.



FIG. 2A shows a diagram of a client application node in accordance with one or more embodiments of the invention.



FIG. 2B shows a diagram of a client file system (FS) container in accordance with one or more embodiments of the invention.



FIG. 3 shows an example of a metadata node in accordance with one or more embodiments of the invention.



FIG. 4 shows an example of a storage node in accordance with one or more embodiments of the invention.



FIG. 5A shows relationships between various virtual elements in the system in accordance with one or more embodiments of the invention.



FIG. 5B shows relationships between various virtual and physical elements in the system in accordance with one or more embodiments of the invention.



FIG. 6 shows a flowchart of a method of generating and servicing a mapping request in accordance with one or more embodiments of the invention.



FIG. 7 shows a flowchart of a method of servicing a write request in accordance with one or more embodiments of the invention.





DETAILED DESCRIPTION

Specific embodiments will now be described with reference to the accompanying figures. In the following description, numerous details are set forth as examples of the invention. One of ordinary skill in the art, having the benefit of this detailed description, would appreciate that one or more embodiments of the present invention may be practiced without these specific details and that numerous variations or modifications may be possible without departing from the scope of the invention. Certain details known to those of ordinary skill in the art may be omitted to avoid obscuring the description.


In the following description of the figures, any component described with regard to a figure, in various embodiments of the invention, may be equivalent to one or more like-named components shown and/or described with regard to any other figure. For brevity, descriptions of these components may not be repeated with regard to each figure. Thus, each and every embodiment of the components of each figure is incorporated by reference and assumed to be optionally present within every other figure having one or more like-named components. Additionally, in accordance with various embodiments of the invention, any description of any component of a figure is to be interpreted as an optional embodiment, which may be implemented in addition to, in conjunction with, or in place of the embodiments described with regard to a corresponding like-named component in any other figure.


Throughout the application, ordinal numbers (e.g., first, second, third, etc.) may be used as an adjective for an element (i.e., any noun in the application). The use of ordinal numbers is not to imply or create any particular ordering of the elements nor to limit any element to being only a single element unless expressly disclosed, such as by the use of the terms “before”, “after”, “single”, and other such terminology. Rather, the use of ordinal numbers is to distinguish between the elements. By way of an example, a first element is distinct from a second element, and the first element may encompass more than one element and succeed (or precede) the second element in an ordering of elements.


As used herein, the term ‘operatively connected’, or ‘operative connection’, means that there exists between elements/components/devices a direct or indirect connection that allows the elements to interact with one another in some way (e.g., via the exchange of information). For example, the phrase ‘operatively connected’ may refer to any direct (e.g., wired or wireless connection directly between two devices) or indirect (e.g., wired and/or wireless connections between any number of devices connecting the operatively connected devices) connection.


In general, embodiments of the invention relate to systems, devices, and methods for implementing and leveraging memory devices (e.g., persistent memory (defined below) and NVMe devices (defined below) to improve performance of data requests (e.g., read and write requests). More specifically, various embodiments of the invention embodiments of the invention enable applications (e.g., applications in the application container in FIG. 2A) to issue data requests (e.g., requests to read and write data) to the operating system (OS). The OS receives such requests and processes them using an implementation of the portable operating system interface (POSIX). The client FS container may receive such requests via POSIX and subsequently process such requests. The processing of these requests includes interacting with metadata nodes (see e.g., FIG. 3) to obtain data layouts that provide a mapping between file offsets and scale out volume offsets (SOVs) (see e.g., FIGS. 5A-5B). Using the SOVs, the memory hypervisor module in the client FS container (see e.g., FIG. 2B) issues input/output (I/O) requests, via a fabric (also referred to as a communication fabric, described below), directly to the locations in the storage pool (110) (see e.g., FIG. 5B), bypassing the storage stack on the metadata nodes. Once the requested I/O is performed on the storage pool, a response is provided, via POSIX, to the application.


Using the aforementioned architecture, embodiments of the invention enable applications to interact with the memory devices at scale in a manner that is transparent to the applications. Said another way, the OS may continue to interact with the client FS container using POSIX and the client FS container, in turn, will provide a transparent mechanism to translate the requests received via POSIX into I/O requests that may be directly serviced by the storage pool.



FIG. 1A shows a diagram of a system in accordance with one or more embodiments of the invention. The system includes one or more clients (100), operatively connected to a network (102), which is operatively connected to one or more node(s) (not shown) in a compute and storage infrastructure (CSI) (104). The components illustrated in FIG. 1A may be connected via any number of operable connections supported by any combination of wired and/or wireless networks (e.g., network (102)). Each component of the system of FIG. 1A is discussed below.


In one embodiment of the invention, the one or more clients (100) are configured to issue requests to the node(s) in the CSI (104) (or to a specific node of the node(s)), to receive responses, and to generally interact with the various components of the nodes (described below).


In one or more embodiments of the invention, one or more clients (100) are implemented as computing devices. Each computing device may include one or more processors, memory (e.g., random access memory), and persistent storage (e.g., disk drives, solid state drives, etc.). The persistent storage may store computer instructions, (e.g., computer code), that when executed by the processor(s) of the computing device cause the computing device to issue one or more requests and to receive one or more responses. Examples of a computing device include a mobile phone, tablet computer, laptop computer, desktop computer, server, distributed computing system, or cloud resource.


In one or more embodiments of the invention, the one or more clients (100) are implemented as a logical device. The logical device may utilize the computing resources of any number of computing devices and thereby provide the functionality of the one or more clients (100) described throughout this application.


In one or more embodiments of the invention, the one or more clients (100) may request data and/or send data to the node(s) in the CSI (104). Further, in one or more embodiments, the one or more clients (100) may initiate an application to execute on one or more client application nodes in the CSI (104) such that the application may, itself, gather, transmit, and/or otherwise manipulate data on the client application nodes, remote to the client(s). In one or more embodiments, one or more clients (100) may share access to the same one or more client application nodes in the CSI (104) and may similarly share any data located on those client application nodes in the CSI (104).


In one or more embodiments of the invention, network (102) of the system is a collection of connected network devices that allow for the communication of data from one network device to other network devices, or the sharing of resources among network devices. Examples of a network (e.g., network (102)) include, but are not limited to, a local area network (LAN), a wide area network (WAN) (e.g., the Internet), a mobile network, or any other type of network that allows for the communication of data and sharing of resources among network devices and/or devices (e.g., clients (100), node(s) in the CSI (104)) operatively connected to the network (102). In one embodiment of the invention, the one or more clients (100) are operatively connected to the node(s) (104) via a network (e.g., network (102)).


The CSI (104) includes one or more client application nodes, one or more metadata nodes, and zero, one or more storage nodes. Additional detail about the architecture of the CSI is provided below in FIG. 1B. Further, various embodiments of the node(s) (104) are provided in FIGS. 2AFIG. 4 below.


While FIG. 1A shows a specific configuration of a system, other configurations may be used without departing from the scope of the disclosure. For example, although the one or more clients (100) and node(s) (104) are shown to be operatively connected through network (102), one or more clients (100) and node(s) (104) may be directly connected, without an intervening network (e.g., network (102)). Further, the functioning of the one or more clients (100) and the node(s) in the CSI (104) is not dependent upon the functioning and/or existence of the other device(s) (e.g., node(s) (104) and one or more clients (100), respectively). Rather, the one or more clients (100) and the node(s) in the CSI (104) may function independently and perform operations locally that do not require communication with other devices. Accordingly, embodiments disclosed herein should not be limited to the configuration of devices and/or components shown in FIG. 1A.



FIG. 1B shows a diagram of computer and storage infrastructure (CSI) in accordance with one or more embodiments of the invention. As discussed above, the client application node(s) (106) executes applications and interacts with the metadata node(s) (108) to obtain, e.g., data layouts and other information (as described below) to enable the client application nodes to directly issue I/O requests to memory devices (or other storage media), which may be located on the client application nodes, the metadata nodes and/or the storage nodes, while bypassing the storage stack (e.g., the metadata server and the file system) on the metadata nodes. To that end, the client application nodes are able to directly communicate over a communication fabric(s) using various communication protocols, e.g., using Non-Volatile Memory Express (NVMe) over Fabric (NVMe-oF) and/or persistent memory over Fabric (PMEMoF), with the storage media in the storage pool (110) (see e.g., FIG. 5B).



FIGS. 2A-2B show diagrams of a client application node (200) in accordance with one or more embodiments of the invention. In one embodiment of the invention, client application node (200) includes one or more application container(s) (e.g., application container (202)), a client FS container (206), an operating system (OS) (208), and a hardware layer (210). Each of these components is described below. In one or more embodiments of the invention, the client application node (200) (or one or more components therein) is configured to perform all, or a portion, of the functionality described in FIGS. 6-7.


In one or more embodiments of the invention, an application container (202) is software executing on the client application node. The application container (202) may be an independent software instance that executes within a larger container management software instance (not shown) (e.g., Docker®, Kubernetes®). In embodiments in which the application container (202) is executing as an isolated software instance, the application container (202) may establish a semi-isolated virtual environment, inside the container, in which to execute one or more applications (e.g., application (212).


In one embodiment of the invention, an application container (202) may be executing in “user space” (e.g., a layer of the software that utilizes low-level system components for the execution of applications) of the OS (208) of the client application node (200).


In one or more embodiments of the invention, an application container (202) includes one or more applications (e.g., application (212)). An application (212) is software executing within the application container (e.g., 202), that may include instructions which, when executed by a processor(s) (not shown) (in the hardware layer (210)), initiate the performance of one or more operations of components of the hardware layer (210). Although applications (212) are shown executing within application containers (202) of FIG. 2A, one or more applications (e.g., 212) may execute outside of an application container (e.g., 212). That is, in one or more embodiments, one or more applications (e.g., 212) may execute in a non-isolated instance, at the same level as the application container (202) or client FS container (206).


In one or more embodiments of the invention, each application (212) includes a virtual address space (e.g., virtual address space (220)). In one embodiment of the invention, a virtual address space (220) is a simulated range of addresses (e.g., identifiable locations) that mimics the physical locations of one or more components of the hardware layer (210). In one embodiment, an application (212) is not configured to identify the physical addresses of one or more components of the hardware layer (210); rather, the application (212) relies on other components of the client application node (200) to translate one or more virtual addresses of the virtual address space (e.g., 220) to one or more physical addresses of one or more components of the hardware layer (210). Accordingly, in one or more embodiments of the invention, an application may utilize a virtual address space (220) to read, write, and/or otherwise manipulate data, without being configured to directly identify the physical address of that data within the components of the hardware layer (210).


Additionally, in one or more embodiments of the invention, an application may coordinate with other components of the client application node (200) to establish a mapping, see e.g., FIG. 6, between a virtual address space (e.g., 220) and underlying physical components of the hardware layer (210). In one embodiment, if a mapping is established, an application's use of the virtual address space (e.g., 220) enables the application to directly manipulate data in the hardware layer (210), without relying on other components of the client application node (200) to repeatedly update mappings between the virtual address space (e.g., 220) and the physical addresses of one or more components of the hardware layer (210). The above discussion with respect to the application's ability to interact with the hardware layer (210) is from the perspective of the application (212). However, as discussed below, the client FS container (206) (in conjunction with the metadata nodes) transparently enables to the application to ultimately read and write (or otherwise manipulate) data remoted and stored in the storage pool.


In one or more embodiments of the invention, a client FS container (206) is software executing on the client application node (200). A client FS container (206) may be an independent software instance that executes within a larger container management software instance (not shown) (e.g., Docker®, Kubernetes®, etc.). In embodiments in where the client FS container (206) is executing as an isolated software instance, the client FS container (206) may establish a semi-isolated virtual environment, inside the container, in which to execute an application (e.g., FS client (240) and memory hypervisor module (242), described below). In one embodiment of the invention, a client FS container (206) may be executing in “user space” (e.g., a layer of the software that utilizes low-level system components for the execution of applications) of the OS (208).


Referring to FIG. 2B, in one embodiment of the invention, the client FS container (206) includes an FS client (240) and a memory hypervisor module (242). In one embodiment, a FS client (240) is software executing within the client FS container (206). The FS client (204) is a local file system that includes functionality to interact with the OS using POSIX (i.e., using file semantics). Said another way, from the perspective of the OS, the FS client is the file system for the client application node and it is a POSIX file system. However, while the FS client interacts with the OS using POSIX, the FS client also includes functionality to interact with the metadata nodes and the memory hypervisor module using protocols other than POSIX (e.g., using memory semantics instead of file semantics).


In one or more embodiments of the invention, FS client (240) may include functionality to generate one or more virtual-to-physical address mappings by translating a virtual address of a virtual address space (220) to a physical address of a component in the hardware layer (210). Further, in one embodiment of the invention, the FS client (240) may further be configured to communicate one or more virtual-to-physical address mappings to one or more components of the hardware layer (210) (e.g., memory management unit (not shown)). In one embodiments of the invention, the FS client (240) tracks and maintains various mappings as described below in FIGS. 5A-5B. Additionally, in one or more embodiments of the invention, FS client (240) is configured to initiate the generation and issuance of I/O requests by the memory hypervisor module (242) (see e.g., FIGS. 6-7).


In one embodiment of the invention, the memory hypervisor module (242) is software executing within the client FS container (206) that includes functionality to generate and issue I/O requests over fabric directly to storage media in the storage pool. Additional detail about the operation of the memory hypervisor module is described below in FIGS. 6-7.


Returning to FIG. 2A, in one or more embodiments of the invention, an OS (208) is software executing on the client application node (200). In one embodiment of the invention, an OS (208) coordinates operations between software executing in “user space” (e.g., containers (202, 206), applications (212)) and one or more components of the hardware layer (210) to facilitate the proper use of those hardware layer (210) components. In one or more embodiments of the invention, the OS (208) includes a kernel module (230) and a GPU module (246). In one embodiment of the invention, the kernel module (230) is software executing in the OS (208) that monitors data (which may include read and write requests) traversing the OS (208) and may intercept, modify, and/or otherwise alter that data based on one or more conditions. In one embodiment of the invention, the kernel module (230) is capable of redirecting data received by the OS (208) by intercepting and modifying that data to specify a recipient different than normally specified by the OS (208).


In one embodiment of the invention, the GPU module (246) is software executing in the OS (208) that manages the mappings between the virtual address space and physical addresses in the GPU memory (not shown) that the GPU(s) (244) is using. Said another way, the application executing in the application container may be GPU-aware and, as such, store data directly within the GPU memory. The application may interact with the GPU memory using virtual addresses. The GPU module (246) maintains a mapping between the virtual addresses used by the application and the corresponding physical address of the data located in the GPU memory. In one embodiment of the invention, prior to the application using the GPU memory, the GPU module may register all or a portion of the GPU memory with the RDMA engine (which implements RDMA) within the external communication interface(s) (232). This registration allows the data stored within the registered portion of the GPU memory to be directly accessed by the RDMA engine and transferred to the storage nodes (see e.g., FIG. 2A). Because the GPU manages the memory allocated to the GPU, the client FS module needs to be able to interact with the GPU module in order for the memory hypervisor module to issue write requests to the storage nodes. In particular, while the memory hypervisor module obtains information related to the physical location(s) in the storage node(s) on which to write the data it needs the physical address(es) of data in the memory so that it can issue the appropriate write requests to store this data in the storage node(s). Without the physical addresses of the source data within the GPU memory, the memory hypervisor module does not have sufficient information to locate the data in the memory that is the subject of the write request.


In one or more embodiments of the invention, the hardware layer (210) is a collection of physical components configured to perform the operations of the client application node (200) and/or otherwise execute the software of the client application node (200) (e.g., those of the containers (202, 206), applications (e.g., 212)).


In one embodiment of the invention, the hardware layer (210) includes one or more communication interface(s) (232). In one embodiment of the invention, a communication interface (232) is a hardware component that provides capabilities to interface the client application node (200) with one or more devices (e.g., a client, another node in the CSI (104), etc.) and allow for the transmission and receipt of data (including metadata) with those device(s). A communication interface (232) may communicate via any suitable form of wired interface (e.g., Ethernet, fiber optic, serial communication etc.) and/or wireless interface and utilize one or more protocols for the transmission and receipt of data (e.g., Transmission Control Protocol (TCP)/Internet Protocol (IP), Remote Direct Memory Access, IEEE 801.11, etc.).


In one embodiment of the invention, the communication interface (232) may implement and/or support one or more protocols to enable the communication between the client application nodes and external entities (e.g., other nodes in the CSI, one or more clients, etc.). For example, the communication interface (232) may enable the client application node to be operatively connected, via Ethernet, using a TCP/IP protocol to form a “network fabric” and enable the communication of data between the client application node and other external entities. In one or more embodiments of the invention, each node within the CSI may be given a unique identifier (e.g., an IP address) to be used when utilizing one or more protocols.


Further, in one embodiment of the invention, the communication interface (232), when using certain a protocol or variant thereof, supports streamlined access to storage media of other nodes in the CSI. For example, when utilizing remote direct memory access (RDMA) to access data on another node in the CSI, it may not be necessary to interact with the software (or storage stack) of that other node in the CSI. Rather, when using RDMA (via an RDMA engine (not shown) in the communications interface(s) (232)), it may be possible for the client application node to interact only with the hardware elements of the other node to retrieve and/or transmit data, thereby avoiding any higher-level processing by the software executing on that other node. In other embodiments of the invention, the communicate interface enables direct communication with the storage media of other nodes using Non-Volatile Memory Express (NVMe) over Fabric (NVMe-oF) and/or persistent memory over Fabric (PMEMoF) (both of which may (or may not) utilize all or a portion of the functionality provided by RDMA).


In one embodiment of the invention, the hardware layer (210) includes one or more processor(s) (not shown). In one embodiment of the invention, a processor may be an integrated circuit(s) for processing instructions (e.g., those of the containers (202, 206), applications (e.g., 212) and/or those received via a communication interface (232)). In one embodiment of the invention, processor(s) may be one or more processor cores or processor micro-cores. Further, in one or more embodiments of the invention, one or more processor(s) may include a cache (not shown) (as described).


In one or more embodiments of the invention, the hardware layer (210) includes persistent storage (236). In one embodiment of the invention, persistent storage (236) may be one or more hardware devices capable of storing digital information (e.g., data) in a non-transitory medium. Further, in one embodiment of the invention, when accessing persistent storage (236), other components of client application node (200) are capable of only reading and writing data in fixed-length data segments (e.g., “blocks”) that are larger than the smallest units of data normally accessible (e.g., “bytes”).


Specifically, in one or more embodiments of the invention, when data is read from persistent storage (236), all blocks that include the requested bytes of data (some of which may include other, non-requested bytes of data) must be copied to other byte-accessible storage (e.g., memory). Then, only after the data is located in the other medium, may the requested data be manipulated at “byte-level” before being recompiled into blocks and copied back to the persistent storage (236).


Accordingly, as used herein, “persistent storage”, “persistent storage device”, “block storage”, “block device”, and “block storage device” refer to hardware storage devices that are capable of being accessed only at a “block-level” regardless of whether that device is volatile, non-volatile, persistent, non-persistent, sequential access, random access, solid-state, or disk based. Further, as used herein, the term “block semantics” refers to the methods and commands software employs to access persistent storage (236).


Examples of “persistent storage” (236) include, but are not limited to, certain integrated circuit storage devices (e.g., solid-state drive (SSD), magnetic storage (e.g., hard disk drive (HDD), floppy disk, tape, diskette, etc.), or optical media (e.g., compact disc (CD), digital versatile disc (DVD), NVMe devices, computational storage, etc.). In one embodiment of the invention, NVMe device is a persistent storage that includes SSD that is accessed using the NVMe® specification (which defines how applications communicate with SSD via a peripheral component interconnect express) bus. In one embodiment of the invention, computational storage is persistent storage that includes persistent storage media and microprocessors with domain-specific functionality to efficiently perform specific tasks on the data being stored in the storage device such as encryption and compression.


In one or more embodiments of the invention, the hardware layer (210) includes memory (238). In one embodiment of the invention, memory (238), similar to persistent storage (236), may be one or more hardware devices capable of storing digital information (e.g., data) in a non-transitory medium. However, unlike persistent storage (236), in one or more embodiments of the invention, when accessing memory (238), other components of client application node (200) are capable of reading and writing data at the smallest units of data normally accessible (e.g., “bytes”).


Specifically, in one or more embodiments of the invention, memory (238) may include a unique physical address for each byte stored thereon, thereby enabling software (e.g., applications (212), containers (202, 206)) to access and manipulate data stored in memory (238) by directing commands to a physical address of memory (238) that is associated with a byte of data (e.g., via a virtual-to-physical address mapping). Accordingly, in one or more embodiments of the invention, software is able to perform direct, “byte-level” manipulation of data stored in memory (unlike persistent storage data, which must first copy “blocks” of data to another, intermediary storage mediums prior to reading and/or manipulating data located thereon).


Accordingly, as used herein, “memory”, “memory device”, “memory storage, “memory storage device”, and “byte storage device” refer to hardware storage devices that are capable of being accessed and/or manipulated at a “byte-level” regardless of whether that device is volatile, non-volatile, persistent, non-persistent, sequential access, random access, solid-state, or disk based. As used herein, the terms “byte semantics” and “memory semantics” refer to the methods and commands software employs to access memory (238).


Examples of memory (238) include, but are not limited to, certain integrated circuit storage (e.g., flash memory, random access memory (RAM), dynamic RAM (DRAM), resistive RAM (ReRAM), etc.) and Persistent Memory (PMEM). PMEM is a solid-state high-performance byte-addressable memory device that resides on the memory bus, where the location of the PMEM on the memory bus allows PMEM to have DRAM-like access to data, which means that it has nearly the same speed and latency of DRAM and the non-volatility of NAND flash.


In one embodiment of the invention, the hardware layer (210) includes a memory management unit (MMU) (not shown). In one or more embodiments of the invention, an MMU is hardware configured to translate virtual addresses (e.g., those of a virtual address space (220)) to physical addresses (e.g., those of memory (238)). In one embodiment of the invention, an MMU is operatively connected to memory (238) and is the sole path to access any memory device (e.g., memory (238)) as all commands and data destined for memory (238) must first traverse the MMU prior to accessing memory (238). In one or more embodiments of the invention, an MMU may be configured to handle memory protection (allowing only certain applications to access memory) and provide cache control and bus arbitration. Further, in one or more embodiments of the invention, an MMU may include a translation lookaside buffer (TLB) (as described below).


In one embodiment of the invention, the hardware layer (210) includes one or more graphics processing units (GPUs) (244). In one embodiment of the invention, the GPUs (244) are a type of processors that includes a significantly larger number of cores than the processors discussed above. The GPUs (244) may utilize the cores to perform a large number of processes in parallel. The processes performed by the GPUs may include basic arithmetic operations. The GPUs may perform additional types of processes without departing from the invention.


In one or more embodiments of the invention, the GPUs include computing resources that allow the GPUs to perform the functions described throughout this application. The computing resources may include cache, GPU memory (e.g., dynamic random access memory (DRAM)), and the cores discussed above. The cores may be capable of processing one or more threads at a time and temporarily storing data in the cache and/or local memory during the processing. A thread is a process performed on data by a core of the GPUs.


While FIGS. 2A-2B show a specific configuration of a client application node, other configurations may be used without departing from the scope of the disclosure. Accordingly, embodiments disclosed herein should not be limited to the configuration of devices and/or components shown in FIGS. 2A-2B.



FIG. 3 shows an example of a metadata node in accordance with one or more embodiments of the invention. In one embodiment of the invention, metadata node (300) includes a metadata server (302), a file system (304), a memory hypervisor module (306), an OS (not shown), a communication interface(s) (308), persistent storage (310), and memory (312). Each of these components is described below. In one or more embodiments of the invention, the metadata node (300) (or one or more components therein) is configured to perform all, or a portion, of the functionality described in FIGS. 6-7.


In one embodiment of the invention, the metadata server (302) includes functionality to manage all or a portion of the metadata associated with the CSI The metadata server (302) also includes functionality to service requests for data layouts that it receives from the various client application nodes. Said another way, each metadata node may support multiple client application nodes. As part of this support, the client application nodes may send data layout requests to the metadata node (300). Metadata node (300), in conjunction with the file system (304), generates and/or obtains the requested data layouts and provides the data layouts to the appropriate client application nodes. The data layouts provide a mapping between file offsets and [SOV, offset]s (see e.g., FIG. 5A-5B).


In one embodiment of the invention, the file system (304) includes functionality to manage a sparse virtual space (see e.g., FIG. 5, 510) as well as the mapping between the sparse virtual space and an underlying SOV(s) (see e.g., FIG. 5, 520). The file system (304), the metadata server (302), or another component in the metadata node (300) manages the mappings between the SOV(s) and the underlying storage media in the storage pool. Additional detail about the sparse virtual space and the SOV(s) is provided below with respect to FIGS. 5A-5B.


In one embodiment of the invention, the memory hypervisor module (306) is substantially the same as the memory hypervisor module described in FIG. 2B (e.g., 242).


In one embodiment of the invention, the metadata node (300) includes one or more communication interfaces (308). The communication interfaces are substantially the same as the communication interfaces described in FIG. 2A (e.g., 232).


In one embodiment of the invention, metadata node (300) includes one or more processor(s) (not shown). In one embodiment of the invention, a processor may be an integrated circuit(s) for processing instructions (e.g., those of the metadata server (302), file system (304) and/or those received via a communication interface(s) (308)). In one embodiment of the invention, processor(s) may be one or more processor cores or processor micro-cores. Further, in one or more embodiments of the invention, one or more processor(s) may include a cache (not shown) (as described).


In one or more embodiments of the invention, the metadata node includes persistent storage (310), which is substantially the same as the persistent storage described in FIG. 2A (e.g., 236).


In one or more embodiments of the invention, the metadata node includes memory (312), which is substantially similar to memory described in FIG. 2A (e.g., 238).



FIG. 4 shows an example of a storage node in accordance with one or more embodiments of the invention. In one embodiment of the invention, server node (400) includes a storage server (402), an OS (not shown), a communication interface(s) (404), persistent storage (406), and memory (408). Each of these components is described below. In one or more embodiments of the invention, the server node (400) (or one or more components therein) is configured to perform all, or a portion, of the functionality described in FIGS. 6-7.


In one embodiment of the invention, the storage server (402) includes functionality to manage the memory (408) and persistent storage (406) within the storage node.


In one embodiment of the invention, the server node includes communication interface(s) (404), which is substantially the same as the memory communication interface(s) described in FIG. 2A (e.g., 232).


In one embodiment of the invention, server node (400) includes one or more processor(s) (not shown). In one embodiment of the invention, a processor may be an integrated circuit(s) for processing instructions (e.g., those of the storage server (402), and/or those received via a communication interface (404)). In one embodiment of the invention, processor(s) may be one or more processor cores or processor micro-cores. Further, in one or more embodiments of the invention, one or more processor(s) may include a cache (not shown) (as described).


In one or more embodiments of the invention, the server node includes persistent storage (406)), which is substantially the same as the persistent storage described in FIG. 2A (e.g., 236).


In one or more embodiments of the invention, the server node includes memory (408), which is substantially similar to memory described in FIG. 2A (e.g., 238).



FIGS. 5A-5B show relationships between various physical and virtual elements in the system in accordance with one or more embodiments of the invention. More specifically, FIGS. 5A-5B show the mappings that are maintained by the various nodes in the CSI in order to permit applications to read and/or write data in storage media in a storage pool.


Referring to FIG. 5A, applications (e.g., 212) executing in the application containers (e.g., 202) read and write from a virtual address space (500). The OS (e.g., 208) provides a mapping between offsets in the virtual address space (500) to corresponding logical blocks (e.g., logical block A, logical block B, logical block C) arranged in a file layout (502). Said another way, the OS maps segments of a virtual address space into a “file,” where a virtual address space segment (i.e., a portion of the virtual address space) (not shown) is mapped to a file offset (i.e., an offset in a file defined by the file layout (502)).


When the OS (e.g., 208) interacts with the FS client (e.g., 240), it uses the file name (or file identifier) and offset to refer to a specific location from which the application (e.g., 212) is attempting to read or write. The FS client (e.g., 240) maps the logical blocks (e.g., logical block A, logical block B, logical block C) (which are specified using [file name, offset]) to corresponding file system blocks (FSBs) (e.g., FSB1, FSB2, FSB3). The FSBs that correspond to a given file layout (502) may be referred to as file system layout (504). In one embodiment of the invention, the file layout (502) typically includes a contiguous set of logical blocks, while the file system layout (504) typically includes a set of FSBs, which may or may not be contiguous FSBs. The mapping between the file layout (502) and the file system layout (504) is generated by the metadata server (see e.g., FIGS. 6-7).


Referring to FIG. 5B, the FSBs (e.g., FSB 1 (516), FSB N (518)) correspond to FSBs in a sparse virtual space (510). In one embodiment of the invention, the sparse virtual space (510) is a sparse, virtual data structure that provides a comprehensive layout and mapping of data managed by the file system (e.g., FIG. 3, 304) in the metadata node. Thus, while there may be multiple virtual address space(s) (e.g., virtual address space (500)) and there may be multiple SOVs (520) there is only one sparse virtual space (510).


In one embodiment of the invention, the sparse virtual space (510) may be allocated with several petabytes of sparse space, with the intention being that the aggregate space of the storage media in the storage pool (532) will not exceed several petabytes of physical storage space. Said another way, the sparse virtual space (510) is sized to support an arbitrary number of virtual address spaces and an arbitrary amount of storage media such that the size of the sparse virtual space (510) remains constant after it has been initialized.


The sparse virtual space (510) may be logically divided into a metadata portion (512) and a data portion (514). The metadata portion (512) is allocated for the storage of file system metadata and FS client metadata. The file system metadata and the FS client metadata may correspond to any metadata (examples of which are provided below with respect to FIGS. 6-7) to enable (or that enables) the file system and the FS client to implement one or more embodiments of the invention. The data portion (514) is allocated for the storage of data that is generated by applications (e.g., 212) executing on the client application nodes (e.g., 200). Each of the aforementioned portions may include any number of FSBs (e.g., 516, 518).


In one or more embodiments of the invention, each FSB may be uniformly sized throughout the sparse virtual space (510). In one or more embodiments of the invention, each FSB may be equal to the largest unit of storage in storage media in the storage pool. Alternatively, in one or more embodiments of the invention, each FSB may be allocated to be sufficiently larger than any current and future unit of storage in storage media in the storage pool.


In one or more embodiments of the invention, one or more SOVs (e.g., 520) are mapped to FSBs in the sparse virtual space (510) to ultimately link the FSBs to storage media. More specifically, each SOV is a virtual data space that is mapped to corresponding physical regions of a portion of, one, or several storage devices, which may include one or more memory devices and one or more persistent storage devices. The SOV(s) (e.g., 520) may identify physical regions of the aforementioned devices by maintaining a virtual mapping to the physical addresses of data that comprise those memory devices (e.g., 238, 312, 408) or persistent storage devices (e.g., 236, 310, 406).


In one or more embodiments of the invention, several SOVs may concurrently exist (see e.g., FIG. 15A), each of which is independently mapped to part of, one, or several memory devices. Alternatively, in one embodiment of the invention, there may only be a SOV associated with the physical regions of all devices in a given node (e.g., a client application node, a metadata node, or a storage node).


In one embodiment of the invention, a SOV may be uniquely associated with a single storage device (e.g., a memory device or a persistent storage device). Accordingly, a single SOV may provide a one-to-one virtual emulation of a single storage device of the hardware layer. Alternatively, in one or more embodiments of the invention, a single SOV may be associated with multiple storage devices (e.g., a memory device or a persistent storage device), each sharing some characteristic. For example, there may be a single SOV for two or more DRAM devices and a second memory pool for two or more PMEM devices. One of ordinary skill in the art, having the benefit of this detailed description, would appreciate that SOV(s) (e.g., 520) may be organized by any suitable characteristic of the underlying memory (e.g., based on individual size, collective size, type, speed, etc.).


In one embodiment of the invention, storage pool (532) includes one or more storage devices (e.g., memory devices and/or persistent storage devices). The storage devices (or portions thereof) may be mapped into the SOV in “slice” units (or “slices”). For example, each slice (e.g., 522, 524, 526, 528, 530) may have a size of 256 MB (the invention is not limited to this example). When mapped into the SOV, each slice may include a contiguous set of FSBs that have an aggregate size equal to the size of the slice. Accordingly, each of the aforementioned FSBs (e.g., 516, 518) is logically associated with a slice (e.g., 522, 524, 526, 528, 530) in the SOV. The portion of the slice that is mapped to a given FSB may be specified using by an offset within a SOV (or by an offset within a slice within the SOV). Each portion of the slice within a SOV is mapped to one or more physical locations in the storage pool. In one non-limiting example, the portion of client C (256) may be 4K in size and may be stored in the storage pool (532) as a 6K stripe with four 1K data chunks (e.g., chunk w (534), chunk x (536), chunky (538), chunk z (540)) and two 1K parity chunks (e.g., chunk P (542), chunk Q (544)). In one embodiment of the invention, slices that only include FSBs from the metadata portion are referred to as metadata slices and slices that only include FSBs from the data portion are referred to as data slices.


Using the relationships shown in FIGS. 5A-5B, a logical block (e.g., logical block A, logical block B, logical block C) in a file layout (502) (which may be specified as a [file, offset, length]) is mapped to an FSB (e.g., 516, 518), the FSB (e.g., 516, 518) is mapped to a location in the SOV (520) (which may be specified as a [SOV, offset, length]), and the location in the SOV (520) is ultimately mapped to one or more physical locations (e.g., 534, 536, 538, 540, 542, 544) in a storage media (e.g., memory devices) within a storage pool (532).


Using the aforementioned architecture, the available storage media in the storage pool may increase or decrease in size (as needed) without impacting how the application (e.g., 212) is interacting with the sparse virtual space (510). More specifically, by creating a layer of abstraction between the sparse virtual space (510) and the storage pool (532) using the SOV (520), the sparse virtual space (510) continues to provide FSBs to the applications provided that these FSBs are mapped to a SOV without having to manage the mappings to the underlying storage pool. Further, by utilizing the SOV (520), changes made to the storage pool including how data is protected in the storage pool are performed in a manner that is transparent to the sparse virtual space (510). This enables the size of the storage pool to scale to an arbitrary size (up to the size limit of the sparse virtual space) without modifying the operation of the sparse virtual space (510).



FIG. 6 shows a flowchart of a method of generating and servicing a mapping request in accordance with one or more embodiments of the invention. All or a portion of the method shown in FIG. 6 may be performed by the client application node and/or the metadata node. Another component of the system may perform this method without departing from the invention. While the various steps in this flowchart are presented and described sequentially, one of ordinary skill in the relevant art will appreciate that some or all of the steps may be executed in different orders, may be combined or omitted, and some or all steps may be executed in parallel.


The method shown in FIG. 6 may be performed whenever an application (212) in a client application container (e.g., 202) triggers a page fault. In one embodiment of the invention, a page fault is issued by a processor when an invalid reference is provided to an MMU. Specifically, when a request (initiated by the application) to access or modify memory is sent to the MMU, using a virtual address, the MMU may perform a lookup in a TLB to find a physical address associated with the provided virtual address (e.g., a virtual-to-physical address mapping). However, if the TLB does not provide a physical address associated with the virtual address (e.g., due to the TLB lacking the appropriate virtual-to-physical address mapping), the MMU will be unable to perform the requested operation. Accordingly, the MMU informs the processor that the request cannot be serviced, and in turn, the processor issues a page fault back to the OS informing that the request could not be serviced.


A page fault typically specifies the virtual address (i.e., an address in virtual address space (e.g. 220)). The page fault may specify other information depending on whether the page fault was triggered by a read, write, or mapping request.


In one or more embodiments of the invention, as described in FIG. 2A above, the kernel module is software executing in the OS that monitors data traversing the OS and may intercept, modify, and/or otherwise alter that data based on one or more conditions. In one embodiment of the invention, the kernel module is capable of redirecting data received by the OS by intercepting and modifying that data to specify a recipient different than normally specified by the OS.


In one or more embodiments of the invention, the OS will, initially, be configured to forward the page fault to the application from which the request originated. However, in one embodiment of the invention, the kernel module detects that the OS received a page fault, and instead forwards the page fault to a different location (i.e., the client FS container) instead of the default recipient (i.e., the application container and/or application). In one embodiment of the invention, the kernel module specifically monitors for and detects exception handling processes that specify an application's inability to access the physical location of data.


Turning to FIG. 6, in step 600, the client FS container receives a request from a kernel module to resolve a page fault, where the request specifies at least one [file, offset] corresponding to the virtual address from the virtual address space of the application. Said another way, the virtual address associated with the page fault is translated into a [file, offset]. The [file, offset] is then sent to the client FS container.


In step 602, the FS container sends a request to a metadata node to obtain a data layout associated with the [file, offset]. The request for the data layout may also specify that the request is for read only access or for read write access. In one embodiment of the invention, read only access indicates that the application only wants to read data from a physical location associated with the virtual address while read write access indicates that the application wants to read data from and/or write data to a physical location associated with the virtual address. From the perspective of the application, the physical location is a local physical location (i.e., a physical location in the memory or the persistent storage) on the client application node; however, as shown in FIGS. 5A-5B, the physical location is actually a physical location in the storage pool.


In one embodiment of the invention, each FS client (e.g., 240) is associated with a single file system (e.g., 304) (however, each file system may be associated with multiple FS clients). The request in step 602 is sent to the metadata node that hosts the file system that is associated with the FS client on the client application node (i.e., the client application node on which the page fault was generated).


In step 604, the metadata node receives the request from the FS client container.


In step 606, in response to the request, the metadata server (on the metadata node) identifies one or more FSBs in the sparse virtual space. The identified FSBs correspond to FSB that are allocatable. An FSB is deemed allocatable if: (i) the FSB is mapped to the SOV and (ii) the FSB has not already been allocated. Condition (i) is required because while the sparse virtual space includes a large collection of FSBs, by design, at any given time not all of these FSBs are necessarily associated with any SOV(s). Accordingly, only FSBs that are associated with a SOV at the time step 606 is perform may be allocated. Condition (ii) is required as the sparse virtual space is designed to support applications distributed across multiple clients and, as such, one or more FSBs that are available for allocation may have been previously allocated by another application. The FSBs identified in step 606 may be denoted a pre-allocated FSBs in the event that no application has not written any data to these FSBs.


In one embodiment of the invention, the FSBs identified in step 606 may not be sequential (or contiguous) FSBs in the sparse virtual space. In one or more embodiments of the invention, more than one FSB may be allocated (or pre-allocated) for each logical block. For example, consider a scenario in which each logical block is 8K and each FSB is 4K. In this scenario, two FSBs are allocated (or pre-allocated) for each logical block. The FSBs that are associated with the same logical block may be sequential (or contiguous) FSBs within the sparse virtual space.


In step 608, after the FSB(s) has been allocated (or pre-allocated as the case may be), the metadata server generates a data layout. The data layout provides a mapping between the [file, file offset] (which was included in the request received in step 600) and a [SOV, offset]. The data layout may include one or more of the aforementioned mappings between [file, file offset] and [SOV, offset]. Further, the data layout may also specify the one or more FSBs associated with the data layout.


In one embodiment of the invention, if the request in step 602 specifies read only access, then the data layout will include [file, file offset] to [SOV, offset] mappings for the FSBs that include the data that the application (in the client application node) is attempting to read. In one embodiment of the invention, if the request in step 602 specifies read write access, then then the data layout may include one set of [file, file offset] to [SOV, offset] mappings for the FSBs that include the data that the application (in the client application node) is attempting to read and a second set of [file, file offset] to [SOV, offset] mappings for the FSBs to which the application may write data. The dual set of mappings provided in the aforementioned data layout may be used to support redirected writes, i.e., the application does not overwrite data; rather, all new writes are directed to new FSBs.


Continuing with the discussion of FIG. 6, in step 610, the data layout is sent to the FS client container. The metadata server may track which client application nodes have requested which data layouts. Further, if the request received in step 600 specified read write access, the metadata server may prevent any other client application from accessing the FSBs associated with the data layout generated in Step 608.


In step 612, the client application node receives and caches the data layout from the metadata node. The FS client may also create an association between the logical blocks in the file layout (e.g., 502) and the corresponding FSBs in the file system layout (e.g., 504) based on the data layout.


In one embodiment of the invention, the FS client allocates an appropriate amount of local memory (e.g., local DRAM, local PMEM), which is/will be used to temporarily store data prior to it being committed to (i.e., stored in) the storage pool using the received data layout. Further, if the request that triggered the page fault (see step 600) was a read request, then the FS client may further initiate the reading of the requested data from the appropriate location(s) in the storage pool (e.g., via the memory hypervisor module) and store the obtained data in the aforementioned local memory.


In step 614, the client FS container informs the OS (or kernel module in the OS) of the virtual-to-physical address mapping. The virtual-to-physical address mapping is a mapping of a location in the virtual address space and a physical address in the local memory (as allocated in step 612). Once the aforementioned mapping is provided, the application and/or OS may directly manipulate the local memory of the client application node (i.e., without processing from the client FS container).


While FIG. 6 describes the allocation of local memory (e.g., 238), in scenarios in which the application is GPU-aware (i.e., the application is interacting with the GPU memory of the GPU (e.g., 244), the virtual addresses used in virtual-to-physical address mapping is a mapping of a location in the virtual address space and a physical address in the GPU memory. The allocation of the GPU memory is managed by the GPU module (246) in the OS (208). In addition to managing the virtual-to-physical address mapping for the GPU memory, the GPU module also registers the physical address specified in the virtual-to-physical address mapping with the RDMA engine (i.e., the RDMA engine on the client application node). The registration allows the RDMA engine to directly access the data in the GPU memory and transmit it to the appropriate storage node(s) via the communication fabric.



FIG. 7 shows a flowchart of a method of servicing a write request in accordance with one or more embodiments of the invention. All or a portion of the method shown in FIG. 7 may be performed by the client application node. Another component of the system may perform this method without departing from the invention. While the various steps in this flowchart are presented and described sequentially, one of ordinary skill in the relevant art will appreciate that some or all of the steps may be executed in different orders, may be combined or omitted, and some or all steps may be executed in parallel.


The method shown in FIG. 7 may be performed whenever an application in the application (e.g., 212) wants to write data. More specifically, once the method shown in FIG. 6 has been performed, the application may directly read and write data to the GPU memory of a client application node, which is then written via steps 700-708 to the storage pool. Further, for the data to be persisted the data must be stored in both the storage pool and the corresponding metadata must be stored in the metadata node (see e.g., Step 710-712). Steps 700-708, which relate to the storage of the data in the storage pool, may be initiated by the client application, the OS, or the client FS container. The client application may initiate the storage of the data as part of an msync or fflush command; while the OS and client FS container may initiate the storage of the data as part of its management of the local resources on the client application node.


If the application has initiated the storage of the data using a msync or fflush command, then steps 700-712 are performed, resulting the data being persisted. In this scenario, the data is written to storage as a first part of processing the msync or fflush command, and then the metadata (including the data layout) is stored on the metadata server as the second part of processing the msync or fflush command.


However, if the OS or client FS container initiates the storage of the data, then the corresponding metadata may or may not be committed (i.e., steps 710 and 712 may not be performed). In certain scenarios, steps 710-712 may be initiated by the OS or the client FS container and performed by the client FS container as part of the OS or client FS container managing the local resources (e.g., portions of the cache used to store the data layouts needs to be freed to store other data layouts).


In step 700, a request to write data (i.e., write data to the storage pool; however, the metadata may or may not be committed, see e.g., Step 710) is received by the client FS container from the OS. The request may specify a virtual address corresponding to the location of the data in GPU memory and a [file, offset]. As discussed above the writing of data may also be initiated by the OS and/or the client FS container without departing from the invention. In such embodiments, the request is initiated by the OS and/or another process in the client FS container and the process that initiated the request provides the [file, offset] to the FS client.


In step 702, the FS client obtains the data layout required to service the request. The data layout may be obtained using the [file, offset] in the request received from the OS. The data layout may be obtained from a cache on the client application node. However, if the data layout is not present on the client application node, e.g., because it was invalidated and, thus, removed from the client application node, then the data layout is obtained from the metadata node in accordance with FIG. 6, steps 602-612.


In step 704, the FS client, using the data layout, obtains the SOV offset. As discussed above, the data layout provides a mapping between file offsets (e.g., offsets within a file layout (e.g., 502)) and the [SOV, offset] s in a SOV (e.g., 520). Accordingly, the FS client translates the [file, offset] into [SOV, offset].


In step 706, the memory hypervisor module issues a translation request to the GPU module, where the translation request specifies the virtual address (i.e., the virtual address specified in the write request in step 700).


In step 708, the memory hypervisor module receives a translation response from the GPU module that includes a physical address in the GPU memory, which corresponds to the virtual address. The GPU module is configured to receive the translation request, perform a look-up in the virtual-to-physical address mapping, and provides the resulting physical address in the translation response.


In step 710, the [SOV, offset] is then provided to the memory hypervisor module to process. More specifically, the memory hypervisor module includes the information necessary to generate and issue one or more I/O requests that result in the data being written directly from the GPU memory on the client application node (e.g., via a communication interface(s)) to an appropriate location in storage pool. For example, if the application is attempting to write data associated with logical block A (e.g., [File A, offset 0], then the memory hypervisor module is provided with [SOV, offset 18] (which is determined using the obtained data layout). The memory hypervisor module includes the necessary information to enable it to generate, in this example, one or more I/O requests to specific locations in the storage pool. Said another way, the memory hypervisor module includes functionality to: (i) determine how many I/O requests to generate to store the data associated with [SOV, offset 18]; (ii) divide the data into an appropriate number of chunks (i.e., one chunk per I/O request); (iii) determine the target of each I/O request (the physical location in the storage pool at which the chunk will be stored); and (iv) issue the I/O requests directly to the nodes on which the aforementioned physical locations exist. The issuance of the I/O requests includes initiating the transfer of data from the appropriate location in the GPU memory to the target location specified in the I/O request.


The communication interface(s) in the client application node facilitates the direct transfer of the data from the client application node to the appropriate location in the storage pool. As discussed above, the storage pool may include storage media located in storage devices (e.g., memory devices or persistent storage devices) that may be on client application nodes, metadata nodes, and/or storages. Accordingly, for any given I/O request, the communication interface(s) on the client application node on which the data resides transmits the data directly to communication interface(s) of the target node (i.e., the node that includes the storage media on which the data is to be written).


In step 712, the client application node awaits for confirmation from the target node(s) that the I/O request(s) generated and issued in step 710 has been successfully stored on the target node(s). At the end of step 712, the data has been written to the storage pool; however, the corresponding metadata is not persisted at this point; as such, the data is not deemed to be persisted. Specifically, if the application does not subsequently issue an msync command (e.g., when the application is using memory semantics) or an fflush command (e.g., when the application is using file semantics) the data will be stored in the storage pool but the metadata server will not be aware that such data has been stored. In order to persist the data, steps 714 and 716 are performed. If steps 700-710 were initiated by the OS or the client FS container, then the process may end at step 712 as the data was only written to the storage pool to free local resources (e.g., memory) on the client application node and there is no need at this time to persist the data (i.e., perform steps 714-716). Further, in scenarios in which the OS initiated the writing of the data, then step 712 also includes the client FS container notifying the OS that that the data has been written to the storage pool. However, as discussed below, there may be scenarios in which the data needs to be persisted at this time and, as such, steps 714-716 are performed.


Specifically, the data (and associated metadata) may be persisted as a result of: (i) the application issuing an msync command (e.g., when the application is using memory semantics) or an fflush command (e.g., when the application is using file semantics, (ii) the client FS container initiating (transparently to the application) steps 714 and 716, or (iii) the OS initiating (transparently to the application) steps 714 and 716.


If the application issues a request to commit data (e.g., issues an msync command or an fflush command), then in step 714, the client application node (in response to the confirmation in step 712) sends a request to commit the data layout to the metadata node. The commit request includes the mapping between the file layout and the file system layout (see e.g., FIG. 5A). Upon receipt of the commit request, the metadata server stores the mapping between the file layout and the file system layout. The processing of the commit request may also trigger the invalidation of prior versions of the data layout that are currently cached on other client application nodes. For example, if client application node A requested a data layout with read only access for a [file, offset] corresponding to FSB A and client application node B subsequently requested a data layout with read write access also for FSB A, then once client application node B performs the method in FIG. 7, the data layout on client application node A is invalidated (e.g., based on a command issued by the metadata server) so as to force client application node A to obtain an updated data layout, which then ensures that client application node A is reading the updated version of the data associated with FSB A. The process then proceeds to step 716.


In scenarios in which the OS or client FS container has previously committed the data layout to the metadata node, then when the client FS container receives a request to persist the data from the application, the client FS container confirms that it has previously committed the corresponding data layout (and other related metadata) (without issuing any request to the metadata nodes). After making this determination locally, the client FS container then proceeds to step 716.


Finally, in scenarios in which the OS or the client FS container needs to commit the corresponding metadata to the metadata server (e.g., portions of the cache used to store the data layouts needs to be freed to store other data layouts), then steps 714 and 716 may be initiated by the OS or the client FS container and performed by the client FS container.


In step 716, the client FS container then notifies the OS that the data has been persisted. The OS may then send the appropriate confirmation and/notification to the application that initiated the request to persist the data. The OS does not notify the application when FIG. 7 was initiated by the OS and/or the client FS container. Further, depending on the implementation, the client FS container may or may not notify the OS if steps 714 and 716 were initiated by the client FS container.


While one or more embodiments have been described herein with respect to a limited number of embodiments and examples, those skilled in the art, having benefit of this disclosure, would appreciate that other embodiments can be devised which do not depart from the scope of the embodiments disclosed herein. Accordingly, the scope should be limited only by the attached claims.

Claims
  • 1. A method for storing data, the method comprising: receiving, from an operating system (OS) and by a client file system (FS) container both executing on a client application node, a request to write data to a storage pool operatively connected to the client application node, wherein the client FS container comprises a memory hypervisor module;generating, by the memory hypervisor module, at least one input/output (I/O) request,wherein the at least one I/O request specifies a location in the storage pool and a physical address of the data in a graphics processing unit (GPU) memory in a GPU on the client application node,wherein the location is determined using a data layout,wherein the data layout provides a mapping between a [file, offset] and a [scale out volume (SOV), offset] and the SOV is associated with the storage pool,wherein the SOV provides a layer of abstraction between a sparse virtual space associated with the client application node and storage devices of the storage pool,wherein the storage devices are scalable in size without impacting interaction between the SOV and the client application node, andwherein the physical address is determined using a GPU module; andissuing, by the memory hypervisor module, the at least one I/O request to the storage pool, wherein processing the at least one I/O request results in at least a portion of the data being stored at the location, andwherein a commit of a data layout based on the I/O request is sent to a metadata node operatively connected to the client application node and the storage pool, and a previous data layout associated with the client application node on the metadata node is invalidated.
  • 2. The method of claim 1, wherein the data is associated with a [file, offset] and a virtual address of the sparse virtual space,wherein the GPU module provides a mapping between the virtual address and the physical address.
  • 3. The method of claim 2, wherein the request to write data to the storage pool is initiated by an application executing on the client application node, wherein the application is a GPU-aware application.
  • 4. The method of claim 3, wherein the GPU module registers the physical address with a remote direct memory access (RDMA) engine on the client application node prior the application issuing the request to write data.
  • 5. The method of claim 1, wherein processing the at least one I/O request comprises issuing data directly from the GPU memory to a communication fabric using a remote direct memory access (RDMA) engine on the client application node, wherein the communication fabric operatively connects the client application node to the storage pool.
  • 6. The method of claim 1, wherein the location in the storage pool is a location in a memory device and wherein the memory device is a persistent memory (PMEM) device.
  • 7. The method of claim 1, wherein the GPU module executes in an operating system (OS) of the client application node.
  • 8. A non-transitory computer readable medium comprising instructions which, when executed by a processor, enable the processor to perform a method for storing data, the method comprising: receiving, from an operating system (OS) and by a client file system (FS) container both executing on a client application node, a request to write data to a storage pool operatively connected to the client application node, wherein the client FS container comprises a memory hypervisor module;generating, by the memory hypervisor module, at least one input/output (I/O) request,wherein the at least one I/O request specifies a location in the storage pool and a physical address of the data in a graphics processing unit (GPU) memory in a GPU on the client application node,wherein the location is determined using a data layout,wherein the data layout provides a mapping between a [file, offset] and a [scale out volume (SOV), offset] and the SOV is associated with the storage pool,wherein the SOV provides a layer of abstraction between a sparse virtual space associated with the client application node and storage devices of the storage pool,wherein the storage devices are scalable in size without impacting interaction between the SOV and the client application node, andwherein the physical address is determined using a GPU module; andissuing, by the memory hypervisor module, the at least one I/O request to the storage pool, wherein processing the at least one I/O request results in at least a portion of the data being stored at the location, andwherein a commit of a data layout based on the I/O request is sent to a metadata node operatively connected to the client application node and the storage pool, and a previous data layout associated with the client application node on the metadata node is invalidated.
  • 9. The non-transitory computer readable medium of claim 8, wherein the data is associated with a [file, offset] and a virtual address,wherein the GPU module provides a mapping between the virtual address and the physical address.
  • 10. The non-transitory computer readable medium of claim 9, wherein the request to write data to the storage pool is initiated by an application executing on the client application node, wherein the application is a GPU-aware application.
  • 11. The non-transitory computer readable medium of claim 10, wherein the GPU module registers the physical address with a remote direct memory access (RDMA) engine on the client application node prior the application issuing the request to write data.
  • 12. The non-transitory computer readable medium of claim 8, wherein processing the at least one I/O request comprises issuing data directly from the GPU memory to a communication fabric using a remote direct memory access (RDMA) engine on the client application node, wherein the communication fabric operatively connects the client application node to the storage pool.
  • 13. The non-transitory computer readable medium of claim 8, wherein the location in the storage pool is a location in a memory device and wherein the memory device is a persistent memory (PMEM) device.
  • 14. The non-transitory computer readable medium of claim 8, wherein the GPU module executes in an operating system (OS) of the client application node.
  • 15. A client application node, comprising: a hardware processor;a graphics processing unit (GPU);an application container executing on the hardware processor and comprising an application;a client file system (FS) container executing on the hardware processor and comprising a memory hypervisor module; andan operating system (OS) executing on the hardware processor and comprising a GPU module,wherein the client FS container is configured to: receive, from the OS, a request to write data to a storage pool operatively connected to the client application node;wherein the memory hypervisor module is configured to: generate at least one input/output (I/O) request, wherein the at least one I/O request specifies a location in the storage pool and a physical address of data in a GPU memory in the GPU,wherein the location is determined using a data layout,wherein the data layout provides a mapping between a [file, offset] and a [scale out volume (SOV), offset] and the SOV is associated with the storage pool,wherein the SOV provides a layer of abstraction between a sparse virtual space associated with the client application node and storage devices of the storage pool,wherein the storage devices are scalable in size without impacting interaction between the SOV and the client application node, andwherein the physical address is determined using the GPU module; andissue the at least one I/O request to the storage pool, wherein processing the at least one I/O request results in at least a portion of the data being stored at the location, andwherein a commit of a data layout based on the I/O request is sent to a metadata node operatively connected to the client application node and the storage pool, and a previous data layout associated with the client application node on the metadata node is invalidated.
  • 16. The client application node of claim 15, wherein the data is associated with a [file, offset] and a virtual address,wherein the GPU module provides a mapping between the virtual address and the physical address.
  • 17. The client application node of claim 16, wherein the request to write data to the storage pool is initiated by the application, wherein the application is a GPU-aware application.
  • 18. The client application node of claim 17, wherein the GPU module registers the physical address with a remote direct memory access (RDMA) engine on the client application node prior the application issuing the request to write data.
  • 19. The client application node of claim 15, wherein processing the at least one I/O request comprises issuing data directly from the GPU memory to a communication fabric using a remote direct memory access (RDMA) engine on the client application node, wherein the communication fabric operatively connects the client application node to the storage pool.
  • 20. The client application node of claim 15, wherein the location in the storage pool is a location in a memory device and wherein the memory device is a persistent memory (PMEM) device.
US Referenced Citations (239)
Number Name Date Kind
1651470 Sadtler Dec 1927 A
5394537 Courts et al. Feb 1995 A
5946686 Schmuck et al. Aug 1999 A
6038570 Hitz et al. Mar 2000 A
6067541 Raju et al. May 2000 A
6119208 White et al. Sep 2000 A
6138126 Hitz et al. Oct 2000 A
6412017 Straube et al. Jun 2002 B1
6681303 Watanabe et al. Jan 2004 B1
6725392 Frey et al. Apr 2004 B1
6751702 Hsieh et al. Jun 2004 B1
6985995 Holland et al. Jan 2006 B2
7191198 Asano et al. Mar 2007 B2
7516285 Haynes et al. Apr 2009 B1
7653682 Erasani et al. Jan 2010 B2
7685126 Patel et al. Mar 2010 B2
8112395 Patel et al. Feb 2012 B2
8117388 Jernigan, IV Feb 2012 B2
8195760 Lacapra et al. Jun 2012 B2
8312242 Casper et al. Nov 2012 B2
8364999 Adessa Jan 2013 B1
8370910 Kamei et al. Feb 2013 B2
8407265 Scheer et al. Mar 2013 B1
8429360 Iyer et al. Apr 2013 B1
8510265 Boone et al. Aug 2013 B1
8566673 Kidney et al. Oct 2013 B2
8818951 Muntz et al. Aug 2014 B1
8924684 Vincent Dec 2014 B1
9069553 Zaarur et al. Jun 2015 B2
9104321 Cudak et al. Aug 2015 B2
9172640 Vincent et al. Oct 2015 B2
9250953 Kipp Feb 2016 B2
9300578 Chudgar et al. Mar 2016 B2
9330103 Bono et al. May 2016 B1
9443095 Lähteenmäki Sep 2016 B2
9483369 Sporel Nov 2016 B2
9485310 Bono et al. Nov 2016 B1
9760393 Hiltgen et al. Sep 2017 B2
9779015 Oikarinen et al. Oct 2017 B1
9886735 Soum Feb 2018 B2
9990253 Rajimwale et al. Jun 2018 B1
10031693 Bansode et al. Jul 2018 B1
10156993 Armangau et al. Dec 2018 B1
10209899 Oshins et al. Feb 2019 B2
10248610 Menachem Apr 2019 B2
10346297 Wallace Jul 2019 B1
10348813 Abali et al. Jul 2019 B2
10649867 Roberts et al. May 2020 B2
10693962 Neumann Jun 2020 B1
10740005 Ives et al. Aug 2020 B1
11397545 Hamid Jul 2022 B1
11438231 Gardner et al. Sep 2022 B2
11481261 Frandzel Oct 2022 B1
11570243 Camargos Jan 2023 B2
11574381 Long Feb 2023 B2
11604610 Bono et al. Mar 2023 B2
11604706 Nara Mar 2023 B2
11651470 Zad Tootaghaj May 2023 B2
11677633 Bono et al. Jun 2023 B2
11693572 Naik Jul 2023 B2
11714568 Kilaru Aug 2023 B2
11748143 Kumar Sep 2023 B2
11789830 Jain Oct 2023 B2
11829256 Bansod Nov 2023 B2
11836047 Madan Dec 2023 B2
20030074486 Anastasiadis et al. Apr 2003 A1
20040062245 Sharp et al. Apr 2004 A1
20040172073 Busch et al. Sep 2004 A1
20040210761 Eldar et al. Oct 2004 A1
20050004925 Stahl et al. Jan 2005 A1
20050114557 Arai et al. May 2005 A1
20050172097 Voigt et al. Aug 2005 A1
20060020745 Conley et al. Jan 2006 A1
20060101081 Lin et al. May 2006 A1
20060117135 Thind et al. Jun 2006 A1
20060200858 Zimmer et al. Sep 2006 A1
20060265605 Ramezani Nov 2006 A1
20070011137 Kodama Jan 2007 A1
20070022138 Erasani et al. Jan 2007 A1
20070106861 Miyazaki et al. May 2007 A1
20070136391 Anzai et al. Jun 2007 A1
20070143542 Watanabe et al. Jun 2007 A1
20070245006 Lehikoinen et al. Oct 2007 A1
20080154985 Childs et al. Jun 2008 A1
20080184000 Kawaguchi Jul 2008 A1
20080270461 Gordon et al. Oct 2008 A1
20090077097 Lacapra et al. Mar 2009 A1
20090144416 Chatley et al. Jun 2009 A1
20090150639 Ohata Jun 2009 A1
20090248957 Tzeng Oct 2009 A1
20090300302 Vaghani Dec 2009 A1
20090307538 Hernandez et al. Dec 2009 A1
20090313415 Sabaa et al. Dec 2009 A1
20100049754 Takaoka et al. Feb 2010 A1
20100076933 Hamilton et al. Mar 2010 A1
20100100664 Shimozono Apr 2010 A1
20100115009 Callahan et al. May 2010 A1
20100274772 Samuels Oct 2010 A1
20100306500 Mimatsu Dec 2010 A1
20110161281 Sayyaparaju et al. Jun 2011 A1
20110218966 Barnes et al. Sep 2011 A1
20110289519 Frost Nov 2011 A1
20110314246 Miller et al. Dec 2011 A1
20120096059 Shimizu et al. Apr 2012 A1
20120158882 Oehme Jun 2012 A1
20120250682 Vincent et al. Oct 2012 A1
20120250686 Vincent et al. Oct 2012 A1
20130139000 Nakamura et al. May 2013 A1
20130179481 Halevy Jul 2013 A1
20130227236 Flynn et al. Aug 2013 A1
20130346444 Makkar et al. Dec 2013 A1
20140089619 Khanna et al. Mar 2014 A1
20140171190 Diard Jun 2014 A1
20140188953 Lin et al. Jul 2014 A1
20140195564 Talagala et al. Jul 2014 A1
20140237184 Kazar et al. Aug 2014 A1
20140279859 Benjamin-deckert et al. Sep 2014 A1
20150088882 Hartman et al. Mar 2015 A1
20150212909 Sporel Jul 2015 A1
20150356078 Kishimoto et al. Dec 2015 A1
20160080492 Cheung Mar 2016 A1
20160117254 Susarla et al. Apr 2016 A1
20160188628 Hartman et al. Jun 2016 A1
20160259687 Yoshihara et al. Sep 2016 A1
20160275098 Joseph Sep 2016 A1
20160292179 Von Muhlen et al. Oct 2016 A1
20160342588 Judd Nov 2016 A1
20170131920 Oshins May 2017 A1
20170132163 Aslot et al. May 2017 A1
20170169233 Hsu et al. Jun 2017 A1
20170249215 Gandhi Aug 2017 A1
20170286153 Bak et al. Oct 2017 A1
20180032249 Makhervaks et al. Feb 2018 A1
20180095915 Prabhakar et al. Apr 2018 A1
20180109471 Chang et al. Apr 2018 A1
20180212825 Umbehocker et al. Jul 2018 A1
20180307472 Paul et al. Oct 2018 A1
20190044946 Hwang et al. Feb 2019 A1
20190238590 Talukdar et al. Aug 2019 A1
20190339896 Mccloskey et al. Nov 2019 A1
20190347204 Du et al. Nov 2019 A1
20190370042 Gupta et al. Dec 2019 A1
20190377892 Ben Dayan et al. Dec 2019 A1
20200004452 Kobayashi et al. Jan 2020 A1
20200110554 Yang Apr 2020 A1
20200241805 Armangau et al. Jul 2020 A1
20210026774 Lim Jan 2021 A1
20210042141 De Marco Feb 2021 A1
20210117246 Lal Apr 2021 A1
20210132870 Bono May 2021 A1
20210133109 Bono May 2021 A1
20210160318 Sajeepa May 2021 A1
20210173588 Kannan Jun 2021 A1
20210173744 Agrawal Jun 2021 A1
20210182190 Gao Jun 2021 A1
20210191638 Miladinovic Jun 2021 A1
20210232331 Kannan Jul 2021 A1
20210240611 Tumanova Aug 2021 A1
20210243255 Perneti Aug 2021 A1
20210286517 Karr Sep 2021 A1
20210286546 Hodgson Sep 2021 A1
20210303164 Grunwald Sep 2021 A1
20210303519 Periyagaram Sep 2021 A1
20210303522 Periyagaram Sep 2021 A1
20210303523 Periyagaram Sep 2021 A1
20210311641 Prakashaiah Oct 2021 A1
20210314404 Glek Oct 2021 A1
20210318827 Bernat Oct 2021 A1
20210326048 Karr Oct 2021 A1
20210326223 Grunwald Oct 2021 A1
20210334206 Colgrove Oct 2021 A1
20210349636 Gold Nov 2021 A1
20210349649 Lee Nov 2021 A1
20210349653 DeWitt Nov 2021 A1
20210373973 Ekins Dec 2021 A1
20210382800 Lee Dec 2021 A1
20220011945 Coleman Jan 2022 A1
20220011955 Juch Jan 2022 A1
20220019350 Karr Jan 2022 A1
20220019366 Freilich Jan 2022 A1
20220019367 Freilich Jan 2022 A1
20220019505 Lee Jan 2022 A1
20220027051 Kant Jan 2022 A1
20220027064 Botes Jan 2022 A1
20220027472 Golden Jan 2022 A1
20220035714 Schultz Feb 2022 A1
20220050858 Karr Feb 2022 A1
20220075546 Potyraj Mar 2022 A1
20220075760 Wu Mar 2022 A1
20220137855 Irwin May 2022 A1
20220138223 Sonner May 2022 A1
20220147253 Sajeepa May 2022 A1
20220147365 Bernat May 2022 A1
20220156152 Gao May 2022 A1
20220164120 Kannan May 2022 A1
20220171648 Rodriguez Jun 2022 A1
20220180950 Kannan Jun 2022 A1
20220197505 Kannan Jun 2022 A1
20220197689 Hotinger Jun 2022 A1
20220206691 Lee Jun 2022 A1
20220206696 Gao Jun 2022 A1
20220206702 Gao Jun 2022 A1
20220206910 Vaideeswaran Jun 2022 A1
20220215111 Ekins Jul 2022 A1
20220229851 Danilov et al. Jul 2022 A1
20220232075 Emerson Jul 2022 A1
20220236904 Miller Jul 2022 A1
20220253216 Grunwald Aug 2022 A1
20220253389 Fay Aug 2022 A1
20220261164 Zhuravlev Aug 2022 A1
20220261170 Vohra Aug 2022 A1
20220261178 He Aug 2022 A1
20220261286 Wang Aug 2022 A1
20220263897 Karr Aug 2022 A1
20220269418 Black Aug 2022 A1
20220291837 Shao Sep 2022 A1
20220291858 DeWitt Sep 2022 A1
20220291986 Klein Sep 2022 A1
20220300193 Gao Sep 2022 A1
20220300198 Gao Sep 2022 A1
20220300413 Kannan Sep 2022 A1
20220318264 Jain Oct 2022 A1
20220334725 Mertes Oct 2022 A1
20220334929 Potyraj Oct 2022 A1
20220334990 Karr Oct 2022 A1
20220335005 Fernandez Oct 2022 A1
20220335009 Paul Oct 2022 A1
20220350495 Lee Nov 2022 A1
20220350515 Bono et al. Nov 2022 A1
20220350543 Bono Nov 2022 A1
20220350544 Bono Nov 2022 A1
20220350545 Bono Nov 2022 A1
20220350702 Bono Nov 2022 A1
20220350778 Bono Nov 2022 A1
20220414817 Zad Tootaghaj Dec 2022 A1
20230126664 Bono Apr 2023 A1
20230127387 Bono Apr 2023 A1
20230130893 Bono Apr 2023 A1
20230131787 Bono Apr 2023 A1
Foreign Referenced Citations (2)
Number Date Country
2016196766 Dec 2016 WO
2017079247 May 2017 WO
Non-Patent Literature Citations (5)
Entry
Adam Thompson et al., GPUDirect Storage: A Direct Path Between Storage and GPU Memory, Technical Blog, Aug. 6, 2019, 8 pages, NVIDIA Corporation, https://developer.nvidia.com/blog/gpudirect-storage/, accessed on May 20, 2022.
Metz Joachim, “Hierarchical File System (HFS)” Nov. 4, 2020, Retrieved from the Internet on Dec. 16, 2021 https://github.com/libyal/libfshfs/blob/c52bf4a36bca067510d0672ccce6d449a5a85744/documentation/Hierarchical%20System%20 (HFS).asciidoc (93 pages).
International Search Report and Written Opinion issued in corresponding Application No. PCT/US2021/030138, dated Jan. 21, 2022 (16 pages).
International Search Report and Written Opinion issued in corresponding Application No. PCT/US2021/030141 dated Jan. 4, 2022 (11 pages).
Y. Yamato, Proposal of Automatic GPU Offloading Method from Various Language Applications Proposal of Automatic GPU Offloading Method from Various Language Applications, 2021, pp. 400-404, 10.1109/ICIET51873.2021.9419618 (5 pages).
Related Publications (1)
Number Date Country
20230126511 A1 Apr 2023 US