The present invention relates generally to the field of storage networks, and more specifically to a file-based hybrid storage scheme supporting multiple file switches in an aggregated file system.
An aggregated file system typically includes a large amount of data that are organized into different user files to serve multiple clients. From a client's perspective, one way to measure the performance of the aggregated file system is its file accessibility, i.e., how long it takes for the client to access a user file stored in the system. To improve file accessibility, a user file is often partitioned into multiple stripes that are allocated to different file servers such that file read or write operations can be spread across the multiple file servers and executed in a parallel fashion.
Meanwhile, it is also highly desirable for an aggregated file system to maintain a certain level of data redundancy so that an access request to a user file can still be satisfied even if one file server hosting at least a portion of the user file is temporarily taken offline. For example, the file system may choose to keep multiple identical copies of the user file or its stripes on different file servers through data mirroring. A downside of this scheme is that its disk storage efficiency per file is only 50%.
A more storage efficient approach often applied to block storage is called “Redundant Arrays of Independent Disks” level 5 (or the RAID-5) scheme. Given a user file including multiple stripes, each stripe comprising multiple data fragments, the RAID-5 scheme generates a parity fragment for each stripe through an exclusive-or operation of the data fragments and the data and parity fragments are arranged in such a manner that no two fragments are stored on the same disk or file server. Even though the RAID-5 scheme provides a higher disk storage efficiency (depending upon the number of data and parity fragments per stripe), the maintenance of a parity fragment per stripe seriously impedes certain file operations, e.g., file writes become quite expensive in a RAID-5 environment. Therefore, it is desired to have a new file storage scheme that has a per-file storage efficiency comparable to the RAID-5 scheme, but a per-file operational efficiency similar to the data mirroring scheme.
A hybrid file storage scheme is provided for managing user files in an aggregated file system. According to this hybrid file storage scheme, a user file comprises first and second sets of file segments, the first set being stored in a first array of file servers according to a first scheme and the second set being stored in a second array of file servers according to a second scheme. Upon receipt from a client of a file operating request with respect to a user file, the aggregated file system identifies the first set of file segments stored in the first array and the second set of file segments in the second array and then applies a corresponding operating instruction to the first and second sets of file segments, respectively.
In a first embodiment, a method of managing user files in an aggregated file system comprises receiving from a client a file operating request with respect to a user file, the request including a name of the user file and an operating instruction, identifying a first set of file segments of the user file stored in the aggregated file system according to a first scheme, identifying a second set of file segments of the user file stored in the aggregated file system according to a second scheme, and applying the operating instruction to the first and second sets of file segments, respectively.
In a second embodiment, an aggregated file system comprises a plurality of file servers and a file switch that includes a processor for executing instructions for storing, maintaining and providing access to a set of user files. These instructions include instructions for receiving from a client a file operating request with respect to a user file, the request including a name of the user file and an operating instruction; instructions for identifying a first set of file segments of the user file stored in the aggregated file system according to a first scheme; instructions for identifying a second set of file segments of the user file stored in the aggregated file system according to a second scheme; and instructions for applying the operating instruction to the first and second sets of file segments, respectively. For each user file, the plurality of file servers include a first array of file servers hosting the first set of file segments and a second array of file servers hosting the second set of file segments.
In a third embodiment, a file switch for use in an aggregated file system comprises at least one processing unit for executing computer programs, at least one interface for exchanging information with file servers, metadata server and client computers, a set of user files that have been updated by the file switch during a predefined time period, a request handle module for receiving a file operating request with respect to a user file, a file read module for extracting a plurality of file segments of a user file from the file servers and returning them to a requesting client, a file write module for updating a plurality of file segments of a user file in the file servers in accordance with a new version of the user file, and a file consolidate module for removing one or more user files from the set of updated user files in accordance with a predefined condition.
The aforementioned features and advantages of the invention as well as additional features and advantages thereof will be more clearly understood hereinafter as a result of a detailed description of embodiments of the invention when taken in conjunction with the drawings.
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
User File. A “user file” is a file that a client computer works with (e.g., read, write, etc). A user file may be divided into portions and stored in multiple file servers of an aggregated file system.
Stripe. In the context of a file switch, a “stripe” is a portion of a user file. In some cases, an entire user file will be contained in a single stripe. But if the file being striped becomes larger than the stripe size, an additional stripe is created. In the RAID-5 scheme, each stripe may be further divided into N stripe fragments. Among them, N−1 stripe fragments store data of the user file and one stripe fragment stores parity information based on the data.
Metadata File. In the context of a file switch, a “metadata file” is a file that contains the metadata of a user file. The properties and state information defining the layout and/or other ancillary information of the user file is called metadata. While an ordinary client may not directly access the content of a metadata file by issuing read or write operations, it nonetheless has indirect access to certain metadata stored therein, such as file layout information, file length, etc.
File Switch. A “file switch” is a device performing various file operations in accordance with client instructions. The file switch is logically positioned between a client computer and a set of file servers. To the client computer, the file switch appears to be a file server having enormous storage capacities and high throughput. To the file servers, the file switch appears to be a client computer. The file switch directs the storage of individual user files over multiple file servers, using striping to improve throughput and using mirroring to improve fault tolerance as well as throughput.
The aggregated file system 150 includes a group of file servers 180, at least one metadata server 170 and a group of file switches 160 that have communication channels with the file servers 180 and the metadata server 170, respectively. The aggregated file system 150 typically manages a large number of user files, each one having a unique file name. There are many types of user files that are used for different purposes, including user files for storing data (e.g., database files, music files, MPEGs, videos, etc) and user files that contain applications and programs used by computer users. These user files range in size from a few bytes to multiple terabytes.
Depending upon their respective purposes, different types of user files may have different accessibility requirements and therefore may need different storage schemes. For example, a website's homepage often receives multiple file read requests simultaneously. To reduce the response delay, the aggregated file system may choose the data mirroring scheme for the homepage, with multiple copies residing on different file servers. Each request for the homepage is directed by file switches to one of the file servers, which may be selected so as to balance the system's workload and improve the system's overall performance. When a file is stored using the data mirroring scheme, if one hosting file server is temporarily taken offline, a file access request can be re-directed to and served by another hosting file server. However, as mentioned above, a disadvantage of the data mirroring scheme is that its disk storage efficiency is quite low. As a result, it may not be appropriate for storing a large-volume user file.
The accessibility of a large-volume user file may be limited by the throughput of a single file server, or by the number of file servers used for hosting the user file. To improve file accessibility, a user file may be divided into multiple stripes according to a data striping scheme, e.g., the RAID-5 scheme, in which the stripes are spread across multiple file servers with each one hosting only a portion of the user file. A single access request for the user file is translated by a file switch into multiple access requests, each directed to a different hosting file server, to increase the throughput. Data redundancy in the RAID-5 scheme is achieved by generating a parity fragment for a set of data fragments within a stripe and keeping the data and parity fragments on separate file servers.
It has been observed that the RAID-5 scheme works best when most file access requests are read requests (e.g., if the user file is a read-only video stream). However, the RAID-5 scheme is less efficient if many access requests are write requests that modify at least a portion of the user file (e.g., a database file), because every write operation on a stripe requires a subsequent update of its parity fragment, thereby significantly increasing the cost associated with the write operation. Note that if the parity fragment is not updated after each associated data write operation, the data redundancy of the user file may be temporarily lost until the parity fragment is updated. In this case, temporal windows may exist such that an unrecoverable error or system crash occurring within the windows may cause some user data to be lost. Below is a table comparing the steps necessary for updating a single data fragment within a stripe using non-RAID-5 and RAID-5 data storage schemes:
Therefore, the number of I/O operations needed in the RAID-5 scheme is 1 (step a)+1 (step b)+1 (step f)+1 (step g)=4 while the number needed in the non-RAID-5 scheme is only 2. In other words, a RAID-5 write is at least twice as expensive as a non-RAID-5 write.
In one embodiment of the present invention, a hybrid file storage scheme is proposed that combines the benefit inherent within the data mirroring scheme and the RAID-5 scheme. According to this hybrid file storage scheme, a user file comprises two sets of file segments. One set of file segments is stored in an array of file servers according to the mirroring scheme, each segment corresponding to multiple copies of a stripe fragment on different file servers, and the other set of file segments is stored in another array of file servers according to the RAID-5 scheme, each segment including at least two data fragments and one parity fragment arranged in a round-robin fashion. The user file also has an associated metadata file stored in a metadata server and the metadata file includes data structures identifying the two arrays of hosting file servers. Upon receipt of a file operating request with respect to the user file, a file switch of the aggregated file system invokes a module to access the user file's file segments stored in the two arrays of file servers and conducts certain operations on the stripe fragments stored in the two arrays of file servers accordingly.
In some embodiments, a file switch 220 of the aggregated file system is implemented using a computer system schematically shown in
The memory 209 may include high speed random access memory and may also include non volatile memory, such as one or more magnetic disk storage devices. The memory 209 may include mass storage that is remotely located from the CPU 200. The memory 209 stores the following elements, or a subset of such elements:
The file switch module 212, the state information 230 and the cached information 240 may include executable procedures, sub-modules, tables or other data structures. In other embodiments, additional or different modules and data structures may be used, and some of the modules and/or data structures listed above may not be used. More detailed descriptions of the file read module 213 and the file write module 214 are provided below in connection with
According to some embodiments, a metadata server 208 includes at least a plurality of metadata files, each metadata file associated with a user file.
Referring again to
Note that the aforementioned additional I/O operations required by the RAID-5 scheme on a block-based implementation may be reduced if the parity fragments are cached in a non-volatile random access memory (NVRAM). This approach reduces the number of write operations associated with the parity fragments without creating temporal windows in which the redundancy may be lost. The data stored the NVRAM is retained even during system crashes and it can be written back to disks in the subsequent recovery phase. Since NVRAM is a centralized resource and it is inherently up to date, a parity fragment found in the NVRAM should be accessed first and the copy in the disk should be fetched (and updated if necessary) only if not found in the NVRAM.
Unfortunately, there is a challenge for directly applying the same logic mentioned above to a file-based implementation involving multiple file switches. This is because the high scalability of a file switch based system depends on the fact that multiple file switches operate independently without synchronizing with one another. If the file switches have to synchronize with each other for each cached parity fragment, the scalability of the system is greatly compromised. In contrast, the present invention is directed to a scheme that avoids synchronization of cached parity fragments and handles file updates efficiently so as to minimize delays caused by inter-file switch communications.
In contrast, the file write module as depicted in
The file write module is initially similar to the file read module discussed above. For example, the file switch identifies a metadata file (620) and a stripe fragment distribution bitmap (630). If the content of the bitmap shows that all the data fragments of the user file are in RAID-S format, i.e., this is the first file write request associated with this particular user file, the file switch will skip tasks 640 and 650 and move directly to 670. Otherwise, the file switch selects a first array of file servers hosting the mirrored stripe fragments (640) and updates the content therein in accordance with the bitmap and the new version of the user file (650).
In one embodiment, for each mirrored data fragment found in the first array of file servers, the update operation 650 replaces the old content of the data fragment with the content in the new version if there is any change to the mirrored data fragment.
Note that each mirrored data fragment has a counterpart RAID-5 format data fragment when it is first generated in the first array of file servers, and the creation of the mirrored data fragment means that the content of its RAID-5 counterpart becomes stale. Therefore, any subsequent attempt to access the RAID-5 data fragment will be directed to the mirrored data fragment according to the user file's bitmap. But the stale RAID-5 data fragment in the second array of file servers remains intact until it is replaced by the mirrored data fragment in the first array of file servers. As a result, both the RAID-5 data fragment and its associated parity fragment become stale (however, they are still consistent with each other). More details about this replacement are provided below in connection with
Since data fragments affected by the current file write request may include not only some mirrored data fragments but also some RAID-5 data fragments, the file switch selects a second array of file servers hosting the remaining RAID-5 data fragments of the user file according to the bitmap (670). For each affected RAID-5 data fragment, the file switch generates in the first array of file servers at least two identical copies of the data fragment containing new content derived from the new version (680). As a result, the updated user file comprises two sets of data fragments, one set in the first array of file servers according to the data mirroring scheme and another set in the second array of file servers according to the RAID-5 scheme. Finally, the file switch completes the file write operation by updating the bitmap in the associated metadata file to reflect the current stripe fragment distribution (690).
In some embodiments, the new content of the user file may be provided by the client and therefore has no counterpart data fragment in either array of file servers. In this case, the file switch identifies sufficient free space in the first array of file servers, generates new mirrored data fragments hosting the new content therein, and then updates the metadata bitmap accordingly. In other words, the second array of file servers does not yet have any information referring to the new content.
As discussed above, unlike the conventional RAID-5 file write in which every data fragment update is followed by an expensive parity fragment update, the parity fragments in the second array of file servers are no longer synchronized with the mirrored data fragments in the first array of file servers when the user file exists in the aggregated file system according to the hybrid scheme. However, the parity fragments are still in synch with their respective RAID-5 data fragments in the second array of file servers and can still be used for reconstructing any missing RAID-5 data fragment other than the ones that will be replaced by the mirrored data fragments. Therefore, a user file in the hybrid scheme employs two strategies of improving a user file's availability: (1) if a RAID-5 data fragment is unavailable, the file switch can re-build the data fragment using its sibling data and parity fragments; and (2) if one file server hosting a mirrored data fragment is down, the file switch can visit another file server hosting one of the identical copies of the data fragment. Since the data redundancy occurs at the data fragment level, not at the file level, disk storage efficiency is not seriously compromised in the hybrid scheme.
It will be understood by one skilled in the art that, in an aggregated file system that often handles simultaneous file access requests for a single user file, the file read (or write) module discussed above cannot be executed appropriately unless certain data locking mechanisms have been implemented in the file system, some of which are internally managed by the file system, while others are explicitly invoked by the client. It is also worthy of noting that a file server in the present invention may manage one or more hard disks simultaneously.
Even though a file switch only duplicates data fragments that are affected by a file write request, not the whole user file, it is conceivable that the portion of a user file in the mirrored format will grow as the cumulative number of file write requests grows over time, with more and more disk space required in the first array of file servers for hosting the mirrored data fragments. Consequently, the hybrid file storage scheme slowly converges to a conventional data mirroring scheme and the benefit offered by the hybrid scheme diminishes slowly. For example, an existing use file, after being updated repeatedly, but without any extension, may occupy a storage space having the size of the user file in addition to the parity fragments and the mirrored fragments.
On the other hand, many user files have time-varying visit frequencies. For example, a database file including stock trading information may receive many more visits when the stock market is open than when the market is closed. In many case, the life cycle of a user file can be divided into at least two periods, an active period and an inactive period. During the active period, there is a higher demand for the availability of the user file and the benefit of the hybrid scheme usually outweighs its use of additional storage space. But during inactive periods, the benefits of the hybrid scheme may be outweighed by the costs, and the file system may address this imbalance by reorganizing the user file during the inactive period.
After selecting a user file in the working set according to a predefined selection criterion (720), the consolidator identifies its associated metadata file in the metadata server (730). Based upon the information embedded in the metadata file, e.g., the mirrored stripe fragment distribution bitmap, the consolidator identifies one copy for each mirrored stripe fragment in the first array of file servers and uses them to replace the obsolete RAID-5 data fragments stored in the second array of file servers (740). For each RAID-5 stripe which has at least one data fragment updated, the consolidator locks the user file or a stripe of the user file and recalculates its parity fragment using the new data fragments (750). After updating the user file according to the RAID-S scheme, the consolidator updates the metadata file (760), e.g., resetting the bitmap and other relevant data structures including the two location tables, releases the mirrored data fragments of the user file and eliminates the user file's entry from the working set. As a result, the disk space no longer occupied by the user file is now released for subsequent use. Next, the consolidator checks if a predefined stop condition is met (780), e.g., there is sufficient free disk space in the file system for storing mirrored stripe fragments, or the working set is empty. If the stop condition is met, the consolidating process is terminated. If not, the consolidator returns to task 720 to process next user file in the working set until the working set is emptied or the stop condition is met. In some embodiments, the consolidator monitors the access requests for a user file it is responsible for. If there is a client request for the user file, the consolidator may relinquish its access to the user file so as to allow the client request to go through. This strategy also makes sure that a full consolidation is carried out only when the user file is no longer being accessed by any client.
The bitmap 810 in
The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated.
This application is related to U.S. patent application Ser. No. 10/043,413, entitled File Switch and Switched File System, filed Jan. 10, 2002, and U.S. Provisional Patent Application No. 60/261,153, entitled FILE SWITCH AND SWITCHED FILE SYSTEM and filed Jan. 11, 2001, both of which are incorporated herein by reference.