The present application is related to United States Patent Application entitled “Cluster File System with a Burst Buffer Appliance for Controlling Movement of Data Among Storage Tiers;” and United States Patent Application entitled “Cluster File System with a Burst Buffer Appliance for Coordinated Control of Data Movement Among Storage Tiers Based on User Specification,” each filed contemporaneously herewith and incorporated by reference herein.
The field relates generally to data storage, and more particularly to parallel file systems and other types of cluster file systems.
A cluster file system allows multiple client devices to share access to files over a network. One well-known cluster file system is the Lustre file system. Lustre is a Linux-based high performance cluster file system utilized for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site clusters. Lustre can readily scale to support tens of thousands of clients, petabytes of storage capacity, and hundreds of gigabytes per second of aggregate input-output (IO) throughput. Due to its high performance and scalability, Lustre is utilized in many supercomputers, as well as other complex computing environments, including large enterprise data centers.
In conventional Lustre implementations, it can be difficult to balance the conflicting requirements of storage capacity and IO throughput. IO operations on object storage servers are generally performed directly with back-end storage arrays associated with those servers, and the corresponding storage devices may not be well matched to the current needs of the system. This can lead to situations in which either performance is less than optimal or the costs of implementing the system become excessive.
Accordingly, despite the many advantages of Lustre file systems and other similar cluster file systems, a need remains for additional improvements, particularly with regard to IO operations. For example, further acceleration of IO operations, leading to enhanced system performance relative to conventional arrangements, would be desirable. Additionally or alternatively, an ability to achieve particular levels of performance at lower cost would be advantageous.
Illustrative embodiments of the present invention provide cluster file systems that implement coordinated storage tiering control functionality across a plurality of object storage servers using a burst buffer appliance, so as to provide significant improvements relative to conventional arrangements. For example, such arrangements allow for transparent inclusion of a flash storage tier in a cluster file system in a manner that avoids the need for any significant changes to clients, object storage servers, metadata servers or applications running on those devices.
In one embodiment, a cluster file system comprises a burst buffer appliance coupled to a plurality of object storage servers via a network. The burst buffer appliance is configured to implement storage tiering control functionality for at least first and second storage tiers comprising respective disjoint subsets of the plurality of object storage servers. The burst buffer appliance implements a coordinated movement of data between the first and second storage tiers to pre-fetch at least one additional portion of a single logical file that is stored across a plurality of said object storage devices from another of said plurality of object storage devices. A parallel log structured file system (PLFS) daemon may be employed to communicate with PLFS daemons on other object storage devices in the cluster file system to implement the coordinated movement of data. The PLFS daemon notifies one or more PLFS daemons on the other object storage devices to pre-fetch portions of a single logical file that are stored across a plurality of object storage devices.
According to a further aspect of the invention, the burst buffer appliance implements the coordinated movement of data between the first and second storage tiers such that substantially all portions of a single logical file that are stored across a plurality of said object storage devices in said cluster file system are stored in only one of said storage tiers at a given time.
The object storage servers in the first storage tier may be configured to interface with object storage targets of a first type and the object storage servers in the second storage tier may be configured to interface with object storage targets of a second type different than the first type. For example, the object storage targets of the first type may comprise non-volatile electronic storage devices such as flash storage devices, and the object storage targets of the second type may comprise disk storage devices.
As noted above, illustrative embodiments described herein provide significant improvements relative to conventional arrangements. In some of these embodiments, use of a flash storage tier in conjunction with a disk storage tier allows dynamic balancing of storage capacity and IO throughput requirements in a cluster file system, thereby allowing particular levels of performance to be achieved at a significantly lower cost than would otherwise be possible. Similar improvements are provided using other numbers and types of storage tiers, with migration between the tiers being controlled by one or more burst buffers of the cluster file system.
Illustrative embodiments of the present invention will be described herein with reference to exemplary cluster file systems and associated clients, servers, storage arrays and other processing devices. It is to be appreciated, however, that the invention is not restricted to use with the particular illustrative cluster file system and device configurations shown. Accordingly, the term “cluster file system” as used herein is intended to be broadly construed, so as to encompass, for example, distributed file systems, parallel file systems, and other types of file systems implemented using one or more clusters of processing devices.
According to one aspect of the invention, discussed further below in conjunction with
The cluster file system 100 further comprises a metadata server 108 having an associated metadata target 110. The metadata server 108 is configured to communicate with clients 102 and object storage servers 104 over the network 106. For example, the metadata server 108 may receive metadata requests from the clients 102 over the network 106 and transmit responses to those requests back to the clients over the network 106. The metadata server 108 utilizes its metadata target 110 in processing metadata requests received from the clients 102 over the network 106. The metadata target 110 may comprise a storage array or other type of storage device.
Storage arrays utilized in the cluster file system 100 may comprise, for example, storage products such as VNX® and Symmetrix VMAX®, both commercially available from EMC Corporation of Hopkinton, Mass. A variety of other storage products may be utilized to implement at least a portion of the object storage targets and metadata target of the cluster file system 100.
The network 106 may comprise, for example, a global computer network such as the Internet, a wide area network (WAN), a local area network (LAN), a satellite network, a telephone or cable network, a cellular network, a wireless network such as WiFi or WiMAX, or various portions or combinations of these and other types of networks. The term “network” as used herein is therefore intended to be broadly construed, so as to encompass a wide variety of different network arrangements, including combinations of multiple networks possibly of different types.
The object storage servers 104 in the present embodiment are arranged into first and second storage tiers 112-1 and 112-2, also denoted as Storage Tier 1 and Storage Tier 2, although it is to be appreciated that more than two storage tiers may be used in other embodiments. As noted above, each of the storage devices 105 may be viewed as being representative of an object storage target of the corresponding one of the object storage servers 104. The first and second storage tiers 112-1 and 112-2 comprise respective disjoint subsets of the object storage servers 104. More particularly, the first storage tier 112-1 comprises object storage servers 104-1,1 through 104-1,L1 and the corresponding storage devices 105-1,1 through 105-1,L1, and the second storage tier 112-2 comprises object storage servers 104-2,1 through 104-2,L2 and the corresponding storage devices 105-2,1 through 105-2,L2.
The client 102 may also be referred to herein as simply a “user.” The term “user” should be understood to encompass, by way of example and without limitation, a user device, a person utilizing or otherwise associated with the device, a software client executing on a user device or a combination thereof. An operation described herein as being performed by a user may therefore, for example, be performed by a user device, a person utilizing or otherwise associated with the device, a software client or by a combination thereof.
The different storage tiers 112-1 and 112-2 in this embodiment comprise different types of storage devices 105 having different performance characteristics. As mentioned previously, each of the object storage servers 104 is configured to interface with a corresponding object storage target in the form of a storage device 105 which may comprise a storage array. The object storage servers 104-1,1 through 104-1,L1 in the first storage tier 112-1 are configured to interface with object storage targets of a first type and the object storage servers 104-2,1 through 104-2,L2 in the second storage tier 112-2 are configured to interface with object storage targets of a second type different than the first type. More particularly, in the present embodiment, the object storage targets of the first type comprise respective flash storage devices 105-1,1 through 105-1,L1, and the object storage targets of the second type comprise respective disk storage devices 105-2,1 through 105-2,L2.
The flash storage devices of the first storage tier 112-1 are generally significantly faster in terms of read and write access times than the disk storage devices of the second storage tier 112-2. The flash storage devices are therefore considered “fast” devices in this embodiment relative to the “slow” disk storage devices. Accordingly, the cluster file system 100 may be characterized in the present embodiment as having a “fast” storage tier 112-1 and a “slow” storage tier 112-2, where “fast” and “slow” in this context are relative terms and not intended to denote any particular absolute performance level. These storage tiers comprise respective disjoint subsets of the object storage servers 104 and their associated object storage targets 105. However, numerous alternative tiering arrangements may be used, including three or more tiers each providing a different level of performance. The particular storage devices used in a given storage tier may be varied in other embodiments and multiple distinct storage device types may be used within a single storage tier.
Also, although only a single object storage target is associated with each object storage server 104 in the
The flash storage devices 105-1,1 through 105-1,L1 may be implemented, by way of example, using respective flash Peripheral Component Interconnect Express (PCIe) cards or other types of memory cards installed in a computer or other processing device that implements the corresponding object storage server 104. Numerous alternative arrangements are possible. Also, a variety of other types of non-volatile or volatile memory in any combination may be used to implement at least a portion of the storage devices 105. Examples of alternatives to flash storage devices that may be used as respective object storage targets in other embodiments of the invention include non-volatile memories such as magnetic random access memory (MRAM) and phase change random access memory (PC-RAM).
The flash storage devices of the first storage tier 112-1 generally provide higher performance than the disk storage devices but the disk storage devices of the second storage tier 112-2 generally provide higher capacity at lower cost than the flash storage devices. The exemplary tiering arrangement of
The cluster file system 100 further comprises a burst buffer appliance 150 configured to communicate with clients 102, object storage servers 104 and metadata servers 108 over the network 106. The burst buffer appliance 150 in the present embodiment is assumed to comprise a flash memory or other high-speed memory having a substantially lower access time than the storage tiers 112. The burst buffer appliance 150 may optionally comprise an analytics engine, and may include other components.
Although flash memory will often be used for the high-speed memory of the burst buffer appliance 150, other types of low-latency memory could be used instead of flash memory. Typically, such low-latency memories comprise electronic memories, which may be implemented using non-volatile memories, volatile memories or combinations of non-volatile and volatile memories. Accordingly, the term “burst buffer appliance” as used herein is intended to be broadly construed, so as to encompass any network appliance or other arrangement of hardware and associated software or firmware that collectively provides a high-speed memory and optionally an analytics engine to control access to the high-speed memory. Thus, such an appliance includes a high-speed memory that may be viewed as serving as a buffer between a computer system comprising clients 102 executing on compute nodes (not shown) and a file system such as storage tiers 112, for storing bursts of data associated with different types of IO operations.
In the
More particularly, in this embodiment of
The burst buffer appliance 150 further comprises a processor 156 coupled to a memory 158. The processor 156 may comprise a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other type of processing circuitry, as well as portions or combinations of such circuitry elements. The memory 158 may comprise random access memory (RAM), read-only memory (ROM) or other types of memory, in any combination.
The memory 158 and other memories disclosed herein may be viewed as examples of what are more generally referred to as “computer program products” storing executable computer program code.
Also included in the burst buffer appliance 150 is network interface circuitry 154. The network interface circuitry 154 allows the burst buffer appliance 150 to communicate over the network 106 with the clients 102, object storage servers 104 and metadata servers 108. The network interface circuitry 154 may comprise, for example, one or more conventional transceivers.
The data placement and migration controller 152 of the burst buffer appliance 150 may be implemented at least in part in the form of software that is stored in memory 158 and executed by processor 156.
The burst buffer appliance 150 comprising processor, memory and network interface components as described above is an example of what is more generally referred to herein as a “processing device.” Each of the clients 102, object storage servers 104 and metadata servers 108 may similarly be implemented as a processing device comprising processor, memory and network interface components.
Although only a single burst buffer appliance 150 is shown in the
The cluster file system 100 may be implemented, by way of example, in the form of a Lustre file system, although use of Lustre is not a requirement of the present invention. Accordingly, servers 104 and 108 need not be configured with Lustre functionality, but may instead represent elements of another type of cluster file system. An example of a Lustre file system configured in accordance with an embodiment of the invention will now be described with reference to
As illustrated in
A given OSS 204 exposes multiple OSTs 205 in the present embodiment. Each of the OSTs may comprise one or more storage arrays or other types of storage devices. The total data storage capacity of the Lustre file system 200 is the sum of all the individual data storage capacities represented by the OSTs 205. The clients 202 can concurrently access this collective data storage capacity using data IO requests directed to the OSSs 204 based on metadata obtained from the MDS 208. The IO requests and other similar requests herein may be configured, for example, in accordance with standard portable operating system interface (POSIX) system calls.
The MDS 208 utilizes the MDT 210 to provide metadata services for the Lustre file system 200. The MDT 210 stores file metadata, such as file names, directory structures, and access permissions.
Additional details regarding conventional aspects of Lustre file systems may be found in, for example, Cluster File Systems, Inc., “Lustre: A Scalable, High-Performance File System,” November 2002, pp. 1-13, and F. Wang et al., “Understanding Lustre Filesystem Internals,” Tech Report ORNL/TM-2009/117, April 2010, pp. 1-95, which are incorporated by reference herein.
As indicated previously, it is difficult in conventional Lustre implementations to balance the conflicting requirements of storage capacity and IO throughput. This can lead to situations in which either performance is less than optimal or the costs of implementing the system become excessive.
In the present embodiment, these and other drawbacks of conventional arrangements are addressed by configuring the burst buffer appliance 150 of the Lustre file system 200 to incorporate storage tiering control functionality. As will be described, such arrangements advantageously allow for transparent inclusion of a flash storage tier in a cluster file system in a manner that avoids the need for any significant changes to clients, object storage servers, metadata servers or applications running on those devices. Again, other types and configurations of multiple storage tiers and associated storage devices may be used. Also, multiple burst buffers 150 may be implemented in the system in other embodiments.
The particular storage tiering arrangement implemented in Lustre file system 200 includes first and second storage tiers 212-1 and 212-2, with data migration software 230 being utilized to control movement of data between the tiers. Although shown as separate from the burst buffer appliance 150, the data migration software 230 is assumed to be implemented at least in part in a controller of the burst buffer appliance 150, which may be similar to the data placement and migration controller 152 utilized in the
In the first storage tier 212-1, there are L1 OSSs having K1, K2, . . . KL1 OSTs, respectively. Thus, for example, OSS 204-1,1 has OSTs denoted 205-1,1,1 through 205-1,1,K1, and OSS 204-1,L1 has OSTs denoted 205-1, L1,1 through 205-1, L1,KL1.
In the second storage tier 212-2, there are L2 OSSs having M1, M2, . . . ML2 OSTs, respectively. Thus, for example, OSS 204-2,1 has OSTs denoted 205-2,1,1 through 205-2,1,M1, OSS 204-2,2 has OSTs denoted 205-2,2,1 through 205-2,2,M2, and OSS 204-2,L2 has OSTs denoted 205-2, L2,1 through 205-2, L2,ML2.
As in the
It should be noted with regard to the illustrative embodiments of
Examples of operations that may be performed in the system 100 or 200 utilizing the burst buffer appliance 150 will now be described in more detail with reference to the flow diagram of
In these examples, as in other embodiments described herein, the flash storage tier is also referred to as a “fast” storage tier and the disk storage tier is also referred to as a “slow” storage tier. Again, the terms “fast” and “slow” in this context are relative terms and should not be construed as requiring any particular absolute performance levels.
Referring now more particularly to the flow diagram of
The client CN then sends an “open, create” request to the MDS which responds with metadata that is assumed to comprise a layout of at least a portion of the disk storage tier comprising OSSs 3, 4 and 5. The write operation is then performed by the CN interacting with one or more of the OSSs of the disk storage tier using the layout metadata provided by the MDS. Upon completion of the write operation, an acknowledgement message denoted “ack, done” is provided by the appropriate OSS of the disk storage tier back to the CN. A “close” request is then sent by the CN to the MDS as indicated.
As shown in
In one exemplary implementation, the OSD-PLFS of
The exemplary PLFS daemon 700 runs on each OSS node 400 and communicates with the OSD. For example, a client 102 on a compute node may request data from an OSS. The OSS notifies the PLFS daemon 700 on the OSS 400 of the data request. The PLFS daemon 700 on the originating OSS 400 knows that the exemplary requested data is part of a logical file that is striped across a plurality of OSSs 400. The originating PLFS daemon 700 can then notify PLFS daemons 700 on other OSSs 400 storing portions of the requested logical file of the request and indicate that the other OSSs 400 should pre-fetch their data portions. The exemplary PLFS daemon 700 can also optionally communicate with off-node burst buffer-aware entities.
As shown in
Layer 540 is a modified layer of the stack 500, corresponding to the OSD-PLFS of
Layer 560 is also a modified layer of the stack 500, corresponding to the modified burst buffer implementation of PLFS which comprises the data migration functionality of burst buffer PLFS in accordance with the present invention, as well as conventional PLFS burst buffer functionality. As discussed above, the burst buffer appliance 150 communicates with flash storage 570 (such as flash storage 105-1 of
For a more detailed discussion of stacks for Lustre clustered file systems, see, for example, A. Dilger et al., “Lustre on ZFS,” Lustre Admin and Developer Workshop (Sep. 24, 2012), incorporated by reference herein.
As shown in
The exemplary PLFS daemon 700 comprises functions for processing each possible received item. For example, upon a write operation, the exemplary PLFS daemon 700 will use an Evict command to request the data placement and migration controller 440 (
Similarly, for a read operation, the exemplary PLFS daemon 700 determines whether a file that is stored on a plurality of OSSs should be pre-fetched using a pre-stage command. Likewise, when another daemon 700 suggests pre-staging to the current PLFS daemon 700, the current PLFS daemon 700 employs a pre-stage operation.
As indicated above, one aspect of the invention provides coordinated storage tiering control functionality across a plurality of object storage servers using one or more burst buffer appliances in a cluster file system.
During step 820, the PLFS daemon 700 on the first (originating) OSD-PLFS 850-1 notifies its peer daemons 700 about this access. The peer daemons 700 notify their associated OSD-burst buffer 150-n who, in turn, notify the associated data migrator 440 on the node to begin prefetching the requested data.
Thereafter, a job on a compute node 805 requests those objects from a second OSD-PLFS 850-2 and the requested objects are now returned more quickly since they have been prefetched into the appropriate ldiskfs-flash storage 420 on the node of OSD-PLFS 850-2.
A user starts reading the data. The Lustre client 102 on the compute node 805 calls up the metadata server 108 that tells the client 102 that the file 900 is striped across the OSS's 850. The client 102 starts reading the first stripe strip1 from the first OSS 850-1, in a similar manner to a conventional Lustre file system.
In accordance with aspects of the present invention, the OSD-burst buffer 150 will start reading the data from strip1 and will also ask the data migrator 440 to copy the data from disk storage 105-2 to flash storage 105-1.
The OSD-burst buffer 150 will also send a log of activity to the PLFS daemon 700 which will notice that strip1 is being read and predicts that soon the rest of the related stripes from file 900 on other OSSs will be needed as well. The PLFS daemon 700 on the originating node 850-1 will therefore send a message to its counterpart daemons 700 on peer OSD nodes 850 that will then start the data movement with their data migrators 440. If the prediction is correct, the compute nodes 805 will soon start sending requests for the rest of the strips to the other OSS nodes 850. These reads will then be faster since the data will have been prefetched into a flash storage device 105-1.
According to a further aspect of the invention, the prefetching is coordinated among a plurality of PLFS-OSDs 850. Consider two files in the cluster file system 100, with half of each file being stored on a flash storage device 105-1 and the remainder of each file being stored on a disk storage device 105-2. Aspects of the present invention recognize that when a file is read in parallel from a parallel file system, typically the latency to read the file is the latency of the slowest reader (i.e., if every device reads relatively fast except for one slower reader device, there is no benefit gained since every device must wait on the slowest reader). Thus, when there are two files each only partially on flash, there is no real benefit to the flash storage.
Existing cluster file systems make independent data placement decisions within each OSS. This will make it more likely that many files are partially on flash as opposed to a few files being fully on flash. The horizontal coordinated communication across all OSSs with the present invention migrates files in their entirety.
According to one aspect of the present invention, coordinated decisions are employed regarding the sub-files of a given file so that all or no sub-files of a given file are stored on a flash storage device 105-1. This ensures fast read performances for those files on a flash storage device 105-1 and slow read performance for those files on a disk storage device 105-2. Without the horizontal coordination provided by the present invention, it is much more likely that only some of the sub-files of a given file are stored on a flash storage device 105-1.
It is to be appreciated that the particular operations and associated messaging illustrated in
It should therefore be understood that in other embodiments different arrangements of additional or alternative elements may be used. At least a subset of these elements may be collectively implemented on a common processing platform, or each such element may be implemented on a separate processing platform.
Also, numerous other arrangements of computers, servers, storage devices or other components are possible in the cluster file system 100. Such components can communicate with other elements of the cluster file system 100 over any type of network or other communication media.
As indicated previously, components of a burst buffer appliance as disclosed herein can be implemented at least in part in the form of one or more software programs stored in memory and executed by a processor of a processing device. A memory having such program code embodied therein is an example of what is more generally referred to herein as a “computer program product.”
The cluster file systems 100 and 200 or portions thereof may be implemented using one or more processing platforms each comprising a plurality of processing devices. Each such processing device may comprise processor, memory and network interface components of the type illustrated for burst buffer appliance 150 in
As indicated above, cluster file system functionality such as that described in conjunction with
It should again be emphasized that the above-described embodiments of the invention are presented for purposes of illustration only. Many variations and other alternative embodiments may be used. For example, the disclosed techniques are applicable to a wide variety of other types and arrangements of cluster file systems and associated clients, servers and other processing devices that can benefit from burst buffer implemented storage tiering control functionality as described herein. Also, the particular configurations of system and device elements shown in
Number | Name | Date | Kind |
---|---|---|---|
8250236 | Betts et al. | Aug 2012 | B2 |
8972465 | Faibish et al. | Mar 2015 | B1 |
20130227194 | Kannan et al. | Aug 2013 | A1 |
20140082310 | Nakajima | Mar 2014 | A1 |
20140351300 | Uppu et al. | Nov 2014 | A1 |