This application claims the benefit under 35 USC § 119(a) of Korean Patent Application No. 10-2022-0173904 filed on Dec. 13, 2022, in the Korean Intellectual Property Office, the entire disclosure of which is incorporated herein by reference for all purposes.
The following description relates to a computational network interface card (NIC) with metadata caching.
A network interface card (NIC) is a hardware device used to connect a computer to a network to communicate. A NIC may also be referred to as a network interface controller, a local area network (LAN) card, a physical network interface, a network adapter, a network card, and an Ethernet card.
A NIC transmits data stored in a memory or cache to outside of the NIC. That is, a NIC may include a cache that serves as a buffer for storing information generated during the transmission/reception of data (e.g., during signal conversion).
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In one general aspect, a computational network interface card (NIC) sharing a storage space with a memory expander thereof includes: a NIC memory including instructions; a NIC processor including a cache and electrically connected to the NIC memory; and a NIC configured to transmit data stored in the NIC memory or the cache to a network, wherein, the instructions are configured to cause the NIC processor to perform operations including: receiving a request to read or write metadata; and checking whether the requested metadata is stored in a local metadata cache of the computational NIC by sequentially checking whether the metadata is stored in the cache of the NIC processor, a cache of the NIC, the NIC memory, or the memory expander.
The metadata may be metadata used in a distributed file system and a client device including the computation NIC participates in the distributed file system.
The NIC may be a compute express link (CXL)-NIC that supports CXL.cache as defined in a CXL protocol.
The memory expander may be a CXL memory expander that supports CXL.mem as defined in a CXL protocol.
The computation NIC may be a type 1, type 2, or type 3 device as defined in a CXL protocol.
The NIC and the NIC processor may be connected in series, and a client device including the computational NIC may access the metadata via the NIC processor.
The NIC and the NIC processor may be connected in parallel, and a client device including the computational NIC may access the metadata without passing through the NIC processor.
The operations may further include: when the requested metadata is determined to be stored in the local metadata cache, verifying validity of the cached metadata; or when the requested metadata is determined to not be stored in the local metadata cache, requesting a metadata server for a metadata management permission of the requested metadata.
The verifying of the validity may include: verifying a validity period of the stored metadata; and verifying whether the computational NIC holds a permission corresponding to the metadata work permission.
In a general aspect, an electronic device includes: a memory including instructions; a processor electrically connected to the memory and configured to execute the instructions; and a computational network interface card (NIC) configured to manage distributed filesystem (DFS) metadata work permissions for DFS metadata in a local distributed file system (DFS) metadata cache of the computational NIC, wherein the DFS metadata is metadata for a DFS, wherein the instructions are configured to cause the processor and/or the computational NIC to perform operations including: requesting, to the computational NIC, by the processor, metadata work permission for a piece of DFS metadata required by a first process executing on the processor; based on determining that the computational NIC does not have metadata management permission for the piece of DFS metadata, requesting, by the computational NIC, to a metadata server of the DFS, metadata management permission for the piece of DFS metadata; and obtaining, by the computational NIC, from the metadata server, metadata management permission for the piece of DFS metadata, and based thereon, assigning metadata work permission to the first process.
The metadata work permission may be a metadata read permission or a metadata write permission, and the metadata management permission from the metadata server grants permission to assign DFS metadata read permission or write permission to the process executed by the processor.
The metadata server may be configured to manage the metadata work permission for all metadata of at least a portion of the DFS, and the computational NIC may be configured to be entrusted, by the metadata server, with metadata management permission for at least some of the DFS metadata in the local DFS metadata cache.
The computational NIC may be configured to: based on obtained metadata management permission, assign the metadata read permission to both the first process and a second process executing on the processor.
When another electronic device different from the electronic device obtains the metadata write permission for the piece of DFS metadata, the metadata work permission for the piece of DFS metadata may be surrendered by the electronic device.
The computational NIC may include: a NIC memory including instructions; a NIC processor including a cache and electrically connected to the NIC memory; and a NIC configured to transmit data stored in the NIC memory or the cache to a network connected to the NIC.
In another general aspect, a memory expander supports CXL.mem as defined in a compute express link (CXL) protocol, the memory expander includes: a memory device in which pieces of DFS metadata of a DFS are stored; and a controller configured to control the memory device, wherein the controller is configured to: receive a work request for an operation on one of the pieces of metadata from an electronic device including a computational network interface card (NIC) connected with the memory expander; and in response to the work request, perform at least one operation on one of the pieces of DFS metadata stored in the memory device.
The controller may be further configured to: communicate directly with the computational NIC or the electronic device through a CXL interface.
The computational NIC may be a type 2 or type 3 device as defined in the CXL protocol, and may include a NIC memory supporting CXL.mem as defined in the CXL protocol.
The computational NIC may be further configured to: assign, to a process executed by the electronic device, a metadata work permission to work on a piece of DFS metadata, wherein the memory expander may be configured to: receive the work request from the electronic device executing the process.
The controller may be further configured to: receive, from the electronic device, a metadata read request for reading the one of the pieces of DFS metadata; and in response to the metadata read request, transmit the one of the pieces of DFS metadata to the electronic device.
The controller may be further configured to: receive, from the electronic device, a metadata write request for writing the one of the pieces of DFS metadata and associated result data; and in response to the metadata write request, change the one of the pieces of DFS metadata to the result data and store the result data.
The pieces of DFS metadata are metadata may be used for accessing filesystem objects in a distributed file system.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
Throughout the drawings and the detailed description, unless otherwise described or provided, the same or like drawing reference numerals will be understood to refer to the same or like elements, features, and structures. The drawings may not be to scale, and the relative size, proportions, and depiction of elements in the drawings may be exaggerated for clarity, illustration, and convenience.
The following detailed description is provided to assist the reader in gaining a comprehensive understanding of the methods, apparatuses, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatuses, and/or systems described herein will be apparent after an understanding of the disclosure of this application. For example, the sequences of operations described herein are merely examples, and are not limited to those set forth herein, but may be changed as will be apparent after an understanding of the disclosure of this application, with the exception of operations necessarily occurring in a certain order. Also, descriptions of features that are known after an understanding of the disclosure of this application may be omitted for increased clarity and conciseness.
The features described herein may be embodied in different forms and are not to be construed as being limited to the examples described herein. Rather, the examples described herein have been provided merely to illustrate some of the many possible ways of implementing the methods, apparatuses, and/or systems described herein that will be apparent after an understanding of the disclosure of this application.
The terminology used herein is for describing various examples only and is not to be used to limit the disclosure. The articles “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. As used herein, the term “and/or” includes any one and any combination of any two or more of the associated listed items. As non-limiting examples, terms “comprise” or “comprises,” “include” or “includes,” and “have” or “has” specify the presence of stated features, numbers, operations, members, elements, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, numbers, operations, members, elements, and/or combinations thereof.
Throughout the specification, when a component or element is described as being “connected to,” “coupled to,” or “joined to” another component or element, it may be directly “connected to,” “coupled to,” or “joined to” the other component or element, or there may reasonably be one or more other components or elements intervening therebetween. When a component or element is described as being “directly connected to,” “directly coupled to,” or “directly joined to” another component or element, there can be no other elements intervening therebetween. Likewise, expressions, for example, “between” and “immediately between” and “adjacent to” and “immediately adjacent to” may also be construed as described in the foregoing.
Although terms such as “first,” “second,” and “third”, or A, B, (a), (b), and the like may be used herein to describe various members, components, regions, layers, or sections, these members, components, regions, layers, or sections are not to be limited by these terms. Each of these terminologies is not used to define an essence, order, or sequence of corresponding members, components, regions, layers, or sections, for example, but used merely to distinguish the corresponding members, components, regions, layers, or sections from other members, components, regions, layers, or sections. Thus, a first member, component, region, layer, or section referred to in the examples described herein may also be referred to as a second member, component, region, layer, or section without departing from the teachings of the examples.
Unless otherwise defined, all terms, including technical and scientific terms, used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains and based on an understanding of the disclosure of the present application. Terms, such as those defined in commonly used dictionaries, are to be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the disclosure of the present application and are not to be interpreted in an idealized or overly formal sense unless expressly so defined herein. The use of the term “may” herein with respect to an example or embodiment, e.g., as to what an example or embodiment may include or implement, means that at least one example or embodiment exists where such a feature is included or implemented, while all examples are not limited thereto.
A high-performance computing system may include multiple computer system nodes connected in parallel to perform billions of operations per second. For such a high-performance computing system, a distributed file system 10 in which multiple computer system nodes are connected through a parallel input/output (I/O) path may be used. Clients and servers of the high-performance computing system may be equipped with NICs for network communication, and NICs may be connected to each other through a high-performance network for a data center, for example, and Infiniband or high bandwidth Ethernet network. The distributed file system 10 may provide high-speed access to extremely large and/or numerous file data to support high-performance computing as used, for example, in weather modeling and seismic prospecting. However, the distributed file system 10 is not limited to use with high-performance applications. In some implementations, the distributed file system 10 may be a variation of a Lustre distributed file system.
Referring to
The metadata servers 11 may include metadata targets in which the metadata is stored. The object storage servers 12 may include object storage targets in which file data is stored. Management servers 13 may manage the metadata servers 11 and the object storage servers 12. The management servers 13 may include a management target (MGT) in which data associated with management of the metadata servers 11 and the object storage servers 12 is stored.
Referring to
While the performance of distributed file systems such as the example distributed file system 10 can be improved through the increased performance of object storage servers 12, most operations on file data in previous distributed file systems generally involve corresponding accesses to the metadata server 11 (e.g., access to a metadata server by a client device). Even if the performance of object storage servers 12 is increased, there may be bottlenecks caused by the need to access the metadata servers 11 (generally, for each file request). The performance of the metadata servers 11 may be increased to, for example, meet service level agreements (SLAs) when client devices create overload (e.g., burst demand) to the distributed file system 10. The performance improvement of the metadata server 11 may be partly addressed through server expansion (adding servers), but server expansion requires additional cost for server installation and maintenance/repair, and often, such additional resources for demand spikes go under-utilized at other times.
An electronic device 100 (e.g., a client device) may use local metadata caching to reduce latency of its distributed file system (DFS) requests (such latency that would be incurred by accessing the metadata servers 20) by instead accessing locally cached metadata for DFS requests, which may also reduce load on the metadata servers 20.
The electronic device 100 may provide locally cached DFS metadata through the use of a computational network interface card (NIC) 200 and possibly other components, such as, for example a compute express link (CXL) memory expander 300. The computational NIC 200 may also be referred to as a computational network interface controller or a smart NIC. The electronic device 100 may increase its local metadata caching capacity by using memory of the CXL memory expander 300 and/or the computational NIC 200 for its local metadata cache.
The electronic device 100 may be connected to the metadata server 20 through a network (e.g., an InfiniBand network). In the electronic device 100, the computational NIC 200 may manage local permissions for its locally cached metadata, for example, permission for local processes to work on (modify, access, etc.) the locally cached metadata. The computational NIC 200 may manage the permissions for the cached metadata in a local storage area, and the metadata server 20 may manage the permissions for metadata in a global area/scope. Accordingly, data coherency (or integrity) of the metadata may be maintained efficiently.
As noted, the electronic device 100 may function as a client device in a distributed file system. The electronic device 100 may be a personal computer (PC), a portable device, an application server, or a storage server, to name some examples. A portable device may be implemented as, for example, a laptop computer, a mobile phone, a smartphone, a tablet PC, a mobile Internet device (MID), a personal digital assistant (PDA), an enterprise digital assistant (EDA), a digital still camera, a digital video camera, a portable multimedia player (PMP), a personal (or portable) navigation device (PND), a handheld game console, an e-book, or a smart device. A smart device may be implemented as, for example, a smartwatch, a smart band, or a smart ring.
Still referring to
The processor 110 may process data stored in the memory 120. The processor 110 may execute a computer-readable code (e.g., software) stored in the memory 120 and instructions triggered by the processor 110. The processor 110 may execute an operating system kernel that manages resources of the electronic device 100. Processes/threads managed by the operating system kernel may execute on the processor 110 and may make requests to access resources (e.g., file system objects) of the distributed file system 10. Metadata may be requested directly or indirectly as a result (e.g. by a DFS module) of requesting file system objects.
The processor 110 may be a hardware-implemented data processing device with a physically structured circuit to execute desired operations. The desired operations may include, for example, code or instructions included in a program. The hardware-implemented data processing device may include, for example, a microprocessor, a central processing unit (CPU), a processor core, a multi-core processor, a multiprocessor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or combinations thereof.
The memory 120 may be implemented as a volatile memory device or a non-volatile memory device. A volatile memory device may be implemented as, for example, a dynamic random-access memory (DRAM), a static RAM (SRAM), a thyristor RAM (T-RAM), a zero capacitor RAM (Z-RAM), or a twin transistor RAM (TTRAM). A non-volatile memory device may be implemented as, for example, an electrically erasable programmable read-only memory (EEPROM), a flash memory, a magnetic RAM (MRAM), a spin-transfer torque MRAM (STT-MRAM), a conductive bridging RAM (CBRAM), a ferroelectric RAM (FeRAM), a phase change RAM (PRAM), a resistive RAM (RRAM), a nanotube RRAM, a polymer RAM (PoRAM), a nano floating gate Memory (NFGM), a holographic memory, a molecular electronic memory device, or an insulator resistance change memory.
The computational NIC 200 may also be referred to as a smart NIC or a data processing unit. The computational NIC 200 have one or more processors that may perform functions throughout the network stack (even as high as the application layer, depending on implementation). The computational NIC 200 may be configured for offloading network processing from the processor 110 (a host processor) to a processor of the computational NIC 200 (although such use is not required). The computational NIC 200 may perform an operation (e.g., managing metadata) through a processor thereof (e.g., a NIC processor 210 of
The CXL memory expander 300 may be configured in a similar form to that of a solid-state drive (SSD) and may be an external storage device. The CXL memory expander 300 may be provided at a position where an SSD is provided and may contribute to the expansion of DRAM capacity. The CXL memory expander 300 may be used for caching DFS metadata. The computational NIC 200 and the CXL memory expander 300 are described with reference to
A computational NIC 200 may provide a connection (e.g., to another client device and/or a server) for an electronic device (e.g., the electronic device 100 of
Referring to
The NIC processor 210 may be configured as a CPU that consumes relatively low power, for example, an FPGA or a reduced instruction set computer (RISC) microprocessor. The NIC processor 210 may manage DFS metadata cached in the NIC memory 220, the CXL-NIC 230, a cache 240, and/or the CXL memory expander 300. To avoid the latency time of accessing a metadata server (e.g., the metadata server 20) with a small cost, the NIC processor 210 may be configured to manage local caching of DFS metadata (in some implementations, for improved performance, the NIC processor 210 may be reserved for only/primarily managing local DFS caching).
The NIC memory 220, the CXL-NIC 230, the cache 240, and/or the CXL memory expander 300 may function as a locale cache for storing DFS metadata.
The CXL-NIC 230 may support CXL.cache as defined in a CXL protocol. The CXL-NIC 230 may be a type 1 device as defined in a CXL protocol. The CXL-NIC 230 may include a cache (e.g., a cache 250 of
The cache 240 may be a memory that serves as an intermediate buffer to improve the speed of communication between the NIC memory 220 and the NIC processor 210 (e.g., Level-1 cache). The cache 240 may have a faster data processing/access speed than the NIC memory 220. The NIC processor 210 may perform fast processing of DFS requests by storing and accessing, in the cache 240, some of data (e.g., DFS metadata) stored in the NIC memory 220.
The CXL memory expander 300 may support CXL.mem as defined in a CXL protocol.
The CXL memory expander 300 may be a type 3 device as defined in a CXL protocol. The CXL memory expander 300 may be directly connected to the electronic device 100 and the computational NIC 200 through a CXL interface.
The CXL memory expander 300 may include a memory device 310 and a controller 320.
The memory device 310 may store data (e.g., DFS metadata). The controller 320 may control writing of data (e.g., DFS metadata) in the memory device 310 and reading of data (e.g., DFS metadata) stored in the memory device 310.
The controller 320 may receive a DFS metadata work/operation request which is a request for working/operating on metadata; the request may come from the electronic device 100 including the computational NIC 200. The computational NIC 200 may assign, to a process on the electronic device 100 (e.g., an operating system process), a metadata work permission for working on local DFS metadata. The controller 320 may receive the metadata work request from the process. The metadata described herein may be DFS metadata used in a distributed file system and the metadata work request may be a read or write request.
In response to the received work request, the controller 320 may perform a corresponding operation on the DFS metadata. For example, in response to the read request for metadata, the controller 320 may transmit the metadata corresponding to the read request to the electronic device 100 (e.g., based on the electronic device or its process having a metadata read permission). For example, in response to a metadata write request, the controller 320 may change/update/store DFS metadata included with the request and store the DFS metadata. Writing DFS metadata to a local cache may affect cache coherency. Operations for maintaining the coherency of metadata stored in the CXL memory expander 300 (or another local memory) are described with reference to
The CXL memory expander 300 may expand storage space of the computational NIC 200, and by extension, the electronic device 100. The electronic device 100 including the CXL memory expander 300 may simultaneously use a storage space of a host memory (e.g., the memory 120 of
In addition, caching DFS metadata in the CXL memory expander 300 may facilitate the electronic device 100 accessing the metadata using the CXL memory expander 300. In addition, using the metadata stored in the CXL memory expander 300 may reduce the load on the metadata servers 20, particularly if many clients are configured for local caching of metadata.
The computational NIC 200 (that is handling local DFS requests and providing local metadata caching) may provide an immediate response to a DFS request of the electronic device 100 (e.g., a request relating to or requiring DFS metadata). When receiving a request relating to metadata (e.g., a metadata work request or a file request), the computational NIC 200 may sequentially check its local cache storage: whether the metadata is stored in (i) the cache 240, (ii) a cache of the CXL-NIC 230 (e.g., a cache 250 of
When metadata requested by the electronic device 100 is determined to be stored (available) in the local DFS metadata cache (e.g., any one or more of cache 240, the cache 250 of the CXL-NIC 230, the NIC memory 220, or the CXL memory expander 300), such a case may be referred to as a “cache hit.” When metadata requested by the electronic device 100 is not stored in the local DFS metadata cache, such a case may be referred to as a “cache miss.”
To increase a cache hit probability of the computational NIC 200, storage capacity of the local metadata cache may be increased. To do so, the computational NIC 200 may use a CXL device (e.g., the CXL-NIC 230 or the CXL memory expander 300) supporting the CXL protocol. CXL devices are described next.
A CXL device that supports a CXL protocol (CXL.cache, CXL.mem, and CXL,io) may be classified as type 1, type 2, or type 3 device according to a supported type of the CXL protocol. A CXL device that supports CXL.cache through a CXL interface may be a type 1 device. A CXL device that supports CXL.mem through the CXL interface may be a type 3 device. A CXL device that supports both CXL.cache and CXL.mem through the CXL interface may be a type 2 device.
Referring to the first example 410 of
Referring to the second example 420 of
Referring to
The computational NIC 200 may simultaneously support CXL.mem and CXL.cache as defined in a CXL protocol. The computational NIC 200 may be a type 2 device as defined in the CXL protocol.
The computational NIC 200 may receive a request (e.g., a host request) for metadata from the processor 110 through the CXL interface. Alternatively, the computation NIC 200 may receive a file/object request and may itself generate a corresponding request for DFS metadata. The computational NIC 200 may check whether the DFS metadata is locally cached.
When the metadata is locally cached, the computational NIC 200 may respond quickly to the request of the processor 110 by bypassing accessing a metadata server 20. The computational NIC 200 may first check whether the metadata is stored (e.g., cached) in a cache type of memory (e.g., the cache 240 or the cache 250) because the cache type of memory (e.g., cache 240 and 250) may have a faster data processing speed than other memory included in the local DFS metadata cache (e.g., the NIC memory 220 and the CXL memory expander 300). When the metadata is determined to be not stored in the cache/fast type of memory of the local DFS metadata cache (e.g., cache 240 and 250), the computational NIC 200 may check whether the requested/needed DFS metadata is stored in the ordinary/slower memory of the local DFS metadata cache (e.g., the NIC memory 220 or the CXL memory expander 300).
When the requested DFS metadata is determined to be not stored in the local DFS metadata cache, the computational NIC 200 may respond to the request of the processor 110 by accessing the metadata server 20. Operations performed by the computational NIC 200 are described with reference to
The computational NIC 200 may be arranged with a serial or parallel connection structure according to a type of connection between the NIC processor 210 and a NIC (e.g., the CXL-NIC 230 or a NIC 260). According to the type of connection between the NIC processor 210 and the NIC (e.g., the CXL-NIC 230 or the NIC 260), a path through which a client device (e.g., the electronic device 100 of
As described below, a structure of the computational NIC 200 may have any of six types of structures (or others). Although the structure of a computational NIC 200 may be classified in further detail according to the characteristics (e.g., privately owned and coherence) between a cache included in the NIC processor 210 and a cache included in the NIC (e.g., the CXL-NIC 230 or the NIC 260), such other features are not considered herein for classification. Accordingly, the structure of the computational NIC 200 is not limited to the examples of
Referring to the first example 610 shown in
Referring to the second example 620 of
Referring to the third example 630 of
Referring to the fourth example 640 of
Referring to the fifth example 650 of
Referring to the sixth example 660 of
When a device (e.g., the electronic device 100 of
Referring to
The metadata server may apply a mechanism for granting and revoking metadata management permissions for metadata (e.g., metadata A) to and from the computational NICs (e.g., NIC processor 1 of the first computational NIC and NIC processor 2 of the second computational NIC). Permission may be in the form of a transferrable security token. For example, when the metadata (e.g., metadata A) stored in the metadata server is stored (e.g., cached) in the first local cache 803 of the first computational NIC, the metadata server may transfer (e.g., assign) a token granting local management permission to the first computational NIC (security tokens may have lifespans and expire). While the local management permission obtained by the first computational NIC remains valid, the first electronic device 801 (e.g., a client device) including the first computational NIC may use its locally stored copy of the metadata (e.g., the cached metadata A) without accessing the metadata server.
As discussed next, the nature of the management permission can vary with the type of operation to be performed. In the case of writing, the metadata server may grant only one write-permission token to any of the clients for a piece of metadata (i.e., only one client has permission to locally update that piece of metadata, e.g., metadata A). The client holder of the write-permission token may have exclusive permission to update the corresponding piece of metadata in its local metadata cache. The write-permission token can be pulled back from the current client/holder (revoked) to the metadata server and then the token (or another) may be provided to another client when that other client needs write permission for the corresponding piece of metadata. In the case of reading, the metadata server may grant multiple read-permission tokens to respective clients for a piece of metadata (e.g., metadata A), and the clients with such tokens may manage local reading permission for the relevant metadata (e.g., metadata A) in their corresponding local metadata caches. When needed, e.g., to help maintain cache coherency in the case of a write, the metadata server may revoke any or all of the read-permission tokens. Write permission may be considered “complete” permission in that it also carries read permission. The metadata server(s) may track which tokens/permissions have been granted to which clients/electronic devices for which pieces of metadata, which may enable recall/revocation of same.
When the second electronic device 802 (e.g., a client device), for example, is to perform an update operation on (locally update) its local metadata (e.g., its copy of metadata A in the first local cache 803) for which the write-permission management is currently held by the first computational NIC per its possession of a corresponding write-permission token, the metadata server may revoke (e.g., deactivate) the write-management permission assigned to the first computational NIC by pulling back its write-permission token. The metadata server may then transfer (e.g., assign) the write-permission management for metadata A to the second computational NIC by sending it the same (or another) write-permission token for metadata A. Through this mechanism, data coherency (integrity) may be guaranteed in a situation where multiple electronic devices (e.g., the first and second electronic devices 801 and 802) are accessing the same metadata (e.g., metadata A). The metadata server may collectively manage, in this fashion, the work permission for all metadata in the global area. A computational NIC may be assigned (e.g., entrusted) with the local management permission for some of the metadata in its local DFS metadata cache area.
As noted, operations by a client on metadata may include reading and writing. As described next, parallel reading may be performed by a plurality of electronic devices (e.g., processes/threads of different clients) on one piece of metadata by allowing those electronic devices to each hold read-permission for that piece of metadata at the same time. As noted, writing may be performed by only allowing one electronic device (e.g., one process) at a time to hold write permission for a corresponding piece of metadata.
For example, the first electronic device 801 may request the first computational NIC for a read permission for metadata A required for process 1-1. Such request may be responsive to, or a part of, a request to read the file system object (e.g., a file) associated with metadata A. When the first computational NIC determines that it does not currently have read-permission for metadata A (e.g., permission to assign read permission for metadata A to local threads/processes), the first computational NIC may request the metadata server for read-permission management for metadata A and obtain the management permission (e.g., a read-permission token). The metadata server may make a record of the granted permission and provide the permission. The first computational NIC may then assign read permission for metadata A to its local process 1-1 (e.g., the first electronic device 801 performing process 1-1).
In a case in which the same metadata (e.g., metadata A) is requested when the first electronic device 801 is executing process 1-2 (i.e., when the first electronic device 801 is executing process 1-1 and process 1-2 simultaneously), the first electronic device 801 may request the first computational NIC for the read permission for metadata A required for process 1-2. The first computational NIC, having the read-permission management for metadata A (e.g., the permission to locally assign the read permission), may assign the read permission for metadata A to process 1-2 (e.g., the first electronic device 801 performing process 1-2) without communicating with the metadata server. The first electronic device 801 may simultaneously execute process 1-1 and process 1-2 and allow them to locally access metadata A, without requiring communication with the metadata server, which may thus reduce locking overhead.
In a case in which the same metadata (e.g., metadata A) is requested when the second electronic device 802 is executing process 2-1 (which may be executing in parallel with process 1-1, process 1-2, and process 2-1), the second electronic device 802 may request the second computational NIC for the read permission for metadata A required for process 2-1. The second computational NIC may determine that it does not hold read-permission for metadata A (e.g., permission to locally assign the read permission), and thus the second computational NIC may request the metadata server for read-permission for metadata A and obtain the read-permission (e.g., a read-permission token). The metadata server may make a record of the permission granted and provide the same. The second computational NIC may then assign the read permission for metadata A to process 2-1 (e.g., the second electronic device 802 performing process 2-1). As described above, reading may be performed simultaneously by multiple processes (executing on multiple electronic devices), whereas writing may be performed only by a single process (of a single electronic device) at a time. Note that in the case of a client/device holding write permission for a piece of metadata, that client/device may freely shift the write permission among its local processes (but only one such local process at a time is granted write permission). In addition, a client/device holding the exclusive write permission also has read permission.
In a case in which write permission for the same metadata (e.g., metadata A) is requested by the second electronic device 802 is executing process 2-2 (while process 1-1, process 1-2, and process 2-1 are also executing), the second electronic device 802 may request the second computational NIC for the write permission for metadata A required for process 2-2. The second computational NIC may have only limited management permission (e.g., permission to locally assign only read permission), and thus the second computational NIC may request the metadata server for write-permission management (e.g., a permission to assign the write permission) for metadata A. The metadata server may revoke any previously granted management permissions to any other clients based on its recording of such previously granted permissions (e.g., the read permission assigned to the first computational NIC and the read permission assigned to the second computational NIC, or a previously granted write permission). The metadata server may assign, to the second computational NIC, write-permission management (e.g., a write-permission token granting permission to locally assign the read and/or write permission) for metadata A. The second computational NIC may then assign write permission for metadata A to process 2-2 (e.g., the second electronic device 802 performing process 2-2).
In a case in which the write permission for the same metadata (e.g., metadata A) is requested when the second electronic device 802 is executing process 2-1 (while process 2-2 is executing), the second electronic device 802 may request the second computational NIC for write permission for metadata A required for process 2-1. In response, the second computational NIC may revoke the write permission previously assigned to process 2-2. The second computational NIC may then assign write permission for metadata A to process 2-1 (e.g., the second electronic device 802 executing process 2-1). As described above, since the work permission (e.g., the write permission) for metadata A is moved only internally/locally, communication with the metadata server may not be required, and locking overhead may thus be reduced.
Generally, when a read or write permission is revoked from a client/device by a metadata server, in response, that client/device's computational NIC will also revoke any corresponding read or write permission that has been granted to its local processes.
In operation 910, a processor (e.g., the processor 110 of
In operation 920, when the computational NIC 200 determines that it does not currently management permission for the requested work permission for the metadata (e.g., the metadata required for the first process), the computational NIC 200 may request a metadata server (e.g., a metadata server 20 of
In operation 930, the computational NIC 200 may obtain the management permission (e.g., a read-permission token or a write-permission token) from the metadata server 20, and assign the work permission (e.g., the read and/or write permission for the metadata) to the first process (e.g., the processor 110 executing the first process).
The metadata server 20 may collectively manage work permission for all metadata in a global area (or a global namespace) of a DFS. The computational NIC 200 may be entrusted, by the metadata server 20, with a management permission for some of the metadata in the local DFS metadata cache of the computation NIC 200. As the metadata server 20 and the computational NIC 200 manage the work permission for metadata by dividing responsibility therebetween as described above, coherency (or integrity) of data (e.g., metadata) may be efficiently maintained. Hereinafter, operations internally performed in the computational NIC 200 will be described.
In operation 1010, a NIC processor (e.g., the NIC processor 210 of
In operation 1020, depending on the composition of its local DFS metadata cache, the NIC processor 210 may sequentially check whether the metadata is stored in in a first cache-type portion of its local DFS metadata cache (e.g., the cache 240 of
The NIC 230 may be a CXL-NIC that supports CXL.cache defined in a CXL protocol. The memory expander 300 may be a CXL memory expander that supports CXL.mem defined in the CXL protocol. The computational NIC 200 may be a device of type 1, type 2, or type 3 as defined in the CXL protocol. Moreover, metadata functionality of a computational CXL NIC may be implemented as a core subsystem executing thereon (e.g., an agent, a controller, a driver, etc.). In the case of a computational CXL NIC with .cache functionality, local cache-type portion(s) of the local DFS metadata cache may be accessed using the .cache functionality. In the case of a computation CXL NIC with .mem functionality, a portion of the location DFS metadata cache in a CXL Memory Expander may be accessed through the .mem functionality.
Referring to
In operation 1115, the NIC processor 210 may check whether the requested metadata is stored (e.g., cached) in the cache 240 of the NIC processor 210. In operation 1120, when the requested metadata is stored (e.g., cached) in the cache 240 of the NIC processor 210, the NIC processor 210 may verify whether the stored data (e.g., the cached data) is valid. The NIC processor 210 may verify a validity period (or flag) of the stored metadata. When the requested metadata is not stored in the cache 240 of the NIC processor 210 or the stored metadata is invalid, the NIC processor 210 may check a next storage space. When the requested metadata stored (e.g., cached) in the cache 240 of the NIC processor 210 is valid, the NIC processor 210 may provide a permission-granting response to the electronic device 100 (e.g., a client device) through a CXL interface.
In operation 1125, the NIC processor 210 may check whether the requested metadata is stored (e.g., cached) in the cache 250 of the NIC 230. In operation 1130, when the requested metadata is stored (e.g., cached) in the cache 250 of the NIC 230, the NIC processor 210 may verify whether the stored metadata (e.g., the cached metadata) is valid. When the requested metadata is not stored in the cache 250 of the NIC 230 or the stored metadata is invalid, the NIC processor 210 may check a next storage space.
In operation 1135, the NIC processor 210 may check whether the requested metadata is stored (e.g., cached) in the NIC memory 220. In operation 1140, when the requested metadata is stored (e.g., cached) in the NIC memory 220, the NIC processor 210 may verify whether the stored data (e.g., the cached data) is valid. When the requested metadata is not stored in the NIC memory 220 or the stored metadata is invalid, the NIC processor 210 may check a next storage space.
In operation 1145, the NIC processor 210 may check whether the requested metadata is stored (e.g., cached) in the memory expander 300. In operation 1150, when the requested metadata is stored (e.g., cached) in the memory expander 300, the NIC processor 210 may verify whether the stored data (e.g., the cached data) is valid.
In operation 1155, when the requested metadata is not stored in the memory expander 300 or the stored metadata is invalid, the NIC processor 210 may request a metadata server for the appropriate management permission for the requested metadata.
The locally-cached metadata may be managed by a low-power CPU (e.g., the NIC processor 210) included in the computational NIC 200. To reduce latency (e.g., latency that may be caused by access to the metadata server) with only a small cost, the NIC processor 210 may be configured to perform only an operation of managing metadata.
An electronic device (e.g., the electronic device 100 of
Referring to
Referring to
In the graph 1300, the x-axis indicates the number of processes performed in the system, the y-axis on the left side indicates kilo input/output operations per second (KIOPS) which are scores (e.g., operation processing speed) of a distributed system, and a broken line graph (Lustre, Tiered MDS) indicates a KIOPS score for each system. In the graph 1300, the y-axis on the right side indicates a degree of performance improvement, and a bar graph (Gain) indicates a degree of performance improvement of the system (Tiered MDS) including the electronic device 100 compared to the typical system (Lustre).
When a small number of processes are performed, the system (Tiered MDS) including the electronic device 100 may have higher performance than the typical system (Lustre). When 1 to 40 processes are performed, the performance improvement of the system (Tiered MDS) compared to the typical system (Lustre) may be approximately 242% on average.
The computing apparatuses, the servers, the clients, the electronic devices, the processors, the memories, the NICs, the storage devices, and other apparatuses, devices, units, modules, and components described herein with respect to
The methods illustrated in
Instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above may be written as computer programs, code segments, instructions or any combination thereof, for individually or collectively instructing or configuring the one or more processors or computers to operate as a machine or special-purpose computer to perform the operations that are performed by the hardware components and the methods as described above. In one example, the instructions or software include machine code that is directly executed by the one or more processors or computers, such as machine code produced by a compiler. In another example, the instructions or software includes higher-level code that is executed by the one or more processors or computer using an interpreter. The instructions or software may be written using any programming language based on the block diagrams and the flow charts illustrated in the drawings and the corresponding descriptions herein, which disclose algorithms for performing the operations that are performed by the hardware components and the methods as described above.
The instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above, and any associated data, data files, and data structures, may be recorded, stored, or fixed in or on one or more non-transitory computer-readable storage media. Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access programmable read only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROMs, CD-Rs, CD+Rs, CD-RWs, CD+RWs, DVD-ROMs, DVD−Rs, DVD+Rs, DVD-RWs, DVD+RWs, DVD-RAMs, BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, blue-ray or optical disk storage, hard disk drive (HDD), solid state drive (SSD), flash memory, a card type memory such as multimedia card micro or a card (for example, secure digital (SD) or extreme digital (XD)), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other device that is configured to store the instructions or software and any associated data, data files, and data structures in a non-transitory manner and provide the instructions or software and any associated data, data files, and data structures to one or more processors or computers so that the one or more processors or computers can execute the instructions. In one example, the instructions or software and any associated data, data files, and data structures are distributed over network-coupled computer systems so that the instructions and software and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by the one or more processors or computers.
While this disclosure includes specific examples, it will be apparent after an understanding of the disclosure of this application that various changes in form and details may be made in these examples without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered in a descriptive sense only, and not for purposes of limitation. Descriptions of features or aspects in each example are to be considered as being applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order, and/or if components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.
Therefore, in addition to the above disclosure, the scope of the disclosure may also be defined by the claims and their equivalents, and all variations within the scope of the claims and their equivalents are to be construed as being included in the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2022-0173904 | Dec 2022 | KR | national |