The present application claims priority from Japanese application JP2003-369816 filed on Oct. 30, 2003, the content of which is hereby incorporated by reference into this application.
1. Field of the Invention
The present invention relates to a method of preventing disk fragmentation in a file system capable of reserving a disk storage area.
2. Description of the Related Art
In a conventional file system of UNIX (registered trademark) origin, a file is divided into metadata (inode) which is file management information and user data which is the actual contents of the file. The user data is managed in the unit of a file system block size (e.g., 4 KB). The metadata has a mapping table in order to manage the block position where the user data is stored, the mapping table indicating the correspondence between a file offset and a file system block number. In the conventional file system, the mapping table stores an array of file system block numbers, and the main trend is the block management algorithm wherein as the file offset becomes larger, reference to the block number becomes more indirect.
The block management algorithm will be described by using an example shown in
As the disks, file systems or files have had recently a large capacity, the above-described block management algorithm is becoming to have a limit to the file size to be dealt with and to its performance. Instead of managing mapping information of the relation between the file offset and block in one-to-one correspondence for each block size as in the case of the block management algorithm, a current general tendency is to use an extent method which manages the information of a start file offset, a start block number and a block length, as shown in
If a continuous area of a disk can be allocated, the extent method can express mapping between the user data and disk positions with a small number of entries and is very effective for large scale files. The continuous area cannot always be allocated because the continuous area may be already allocated to another file or because of other reasons. The state that block positions of a disk allocated to one file are dispersed, is called an external fragment.
When fragment occurs in the file system of the extent method, not only the performance is degraded, but also the mapping table becomes bulky. As the mapping table becomes bulky, an insufficient memory is likely to occur, which causes an unstable OS (deadlock, slowdown, panic).
In order to prevent fragments, the following measures are used, for example, in XFS.
Fragments in local accesses can be prevented fairly by the above-described measures. For accesses via NFS, irrespective of the size of an I/O request at an NFS client, the request is divided during the process of network packet assembly so that the I/O length at the server becomes eventually about 4 kB to 8 KB. For Write accesses via NFS, the procedure of Open→Write (4 KB-8 KB, both asynchronous and synchronous)→Fsync (write guarantee)→Close is repeated and a disk write per one I/O occurs so that the effects (1) are not expected.
For accesses via NFS, the reservation is released every 4 KB to 8 KB for (3). This becomes a critical issue in reserving a continuous area. Therefore, the following measure is additionally used.
If (4) functions in a valid manner, fragments are about the reserved size (64 KB) (2) at the worst.
In XFS, 16 bytes are used for one extent entry. If a file of 1 TB is fragmented at 64 KB, the capacity of a mapping table is 256 MB. A current high end NAS system has a storage capacity over 100 TB and a main memory of several GB. Therefore, if the fragmented file of several TB is accessed at the same time, an insufficient memory is likely to occur.
VxFS of the VERITAS Corporation adopts the algorithm which reserves the area twice as large as the current file size when an additional extent is acquired. Although this scheme can fairly prevent fragments, it has the demerit that the area is reserved too much, and file system full is likely to occur.
In order to prevent fragments in conventional file systems, there is a tradeoff that file system full is likely to occur. If the area is reserved largely, it is obvious that the unused area is required to be released. This process cost is required to be paid attention.
Japanese Patent Application JP-A-8-115238 diskloses the techniques that a plurality of storage areas having a plurality of different sizes are duplicatedly reserved, and when actual data is to be stored, the storage area having a proper size is selected. In this manner, data is prevented from being stored in the reserved area which is unnecessarily large, preventing fragments (file fragmentation) more or less. However, when the storage device has no marginal area, reservation itself of a plurality of areas becomes difficult and the initial effects cannot be obtained. There is another problem that the cost of a reserved area release process increases.
A conventional file system is difficult to satisfy both fragment prevention and file system full hardship. The present invention therefore addresses an issue of realizing a file system capable of both fragment prevention and file system full hardship. The invention also addresses an issue of reducing a release cost for an unnecessary area in a small scale file system.
The above-described issues can be solved by the invention by changing an area reservation policy and area reservation size in accordance with a file size. Specifically, for a small size file, reservation is performed at the actual I/O request length, for a file of a middle size or larger, reservation is performed at a reservation size designated in advance in accordance with the file size. When an area of a middle size or larger is reserved, if the reservation fails due to an insufficient empty capacity of the file system, reservation is tried at the read I/O request length to thereby make file system full difficult to occur. For a small size file, reservation is performed at the actual I/O request and the reserved area release process is not performed to improve the I/O response of the small size file.
According to the invention, the reservation size is changed with the file size. It is therefore possible to realize a file system capably of preventing disk fragments and making difficult an insufficient file system capacity to occur by considering the failure of reservation of a whole file or at a large size
For the small size file, reservation is performed at the request I/O size. It is therefore possible to skip the reservation release process for the small size and to improve the response of generating and writing a small size file.
Other objects, features and advantages of the invention will become apparent from the following description of the embodiments of the invention taken in conjunction with the accompanying drawings.
Embodiments of the invention will be described with reference to the accompanying drawings.
When a Write system call is issued, the control is passed to a Write processing unit 400. The Write processing unit 400 sends a reservation request to an area reservation release managing unit 420, by using a reservation size determined by an area reservation issuing unit 401.
If the reservation succeeds, a buffer generating unit 402 generates a buffer, and an I/O issuing unit 403 prepares for an I/O issuance. If an asynchronous I/O is used, the control is passed to a queue capable of issuing an I/O to terminate the Write system call. If a synchronous I/O is used, an I/O is issued and its completion is awaited. After the normal completion is confirmed, the Write system call is terminated.
Next, a reservation release process will be described. When a Close system call is issued, the control is passed to a Close processing unit 410 in the kernel space. The Close processing unit 410 determines whether a reservation area release determining unit 411 executes the release process. If it is determined that the release process is executed, the area reservation release managing unit 420 is requested to execute a release process for an unused area of the reserved area. A resource releasing unit 412 executes the release process for a file descriptor and the like. In this embodiment, although the reserved area is released in the extension of the Close system call, the reserved area may be released in the extension of an Umount system call or in the extension of discard of the inode on a memory.
Next, with reference to
At 102 it is judged whether the start offset of a file descriptor of the file to be written is larger than a sum of a current file size and a whole file judgement threshold value (e.g., 8 KB).
If the start offset is equal to or lager than the sum, a whole file reserved size (e.g., 16 KB) at 111 is adopted. In addition to this embodiment adopting the whole file reserved size, other embodiments are conceivable which adopt immediately the real request size at 122 or a first stage reservation size at 114.
If the start offset is smaller than the sum, the process at 103 follows. At 103 it is judged whether the file size is larger than a third stage threshold value (e.g., 512 MB). If the file size is equal to or large than the third stage threshold value, a third stage reservation size (e.g., 16 MB) at 112 is adopted.
If the file size is not large than the third stage threshold value, the process at 104 follows. At 104 it is judged whether the file size is larger than a second stage threshold value (e.g., 32 MB). If the file size is equal to or larger than the second stage threshold value, a second stage reservation size (e.g., 1 MB) at 113 is adopted.
If the file size is not larger than the second stage threshold value, the process at 105 follows. At 105 it is judged whether the file size is larger than a first stage threshold value (e.g., 64 KB). If the file size is equal to or larger than the first stage threshold value, a first stage reservation size (e.g., 64 KB) at 114 is adopted.
In this embodiment, although the first stage threshold value, second stage threshold value and third stage threshold value are compared with the file size, another embodiment is conceivable which uses a file offset as the comparison object.
If all the conditions 102 to 105 are not satisfied, at 122 the reservation request is issued to the area reservation release managing unit 420, by using an actual I/O size. If any one of the conditions 111 to 114 are satisfied, at 120 the reservation request is issued to the area reservation release managing unit 420, by using respective adopted reservation sizes. At 121 it is checked whether the area reservation fails because of an insufficient file system capacity. If the area reservation fails because of an insufficient file system capacity, at 122 reservation is performed again at the actual I/O request size. If the condition at 121 is not satisfied, namely, if the reservation succeeds or fails due to the reason other than the insufficient file system capacity, a process at 123 follows. After the process at 122 is executed, the process at 123 also follows.
At 123 it is checked whether the area reservation result is a reservation success. If the reservation succeeds, a Write process continues at 132 and the control is passed to the buffer generating unit 402. If the reservation fails, the Write process fails at 131 and an error is notified to a user program.
In this embodiment, although the file size judgement is executed at three stages, the number of stages may be arbitrary. The first stage threshold value may be set to 0. In this case, the process will not transit from 105 to 122.
Next, with reference to
If the condition at 501 is not satisfied, the process at 503 follows. At 503 in order to skip the reservation release process, the process at the resource releasing unit 412 follows without involvement of the process at the area reservation release managing unit 420, to thereafter terminate the Close process.
It is desired that the first stage threshold value described in the Close process is always coincident with the first stage threshold value at 105 shown in
The above-described first stage threshold value, second stage threshold value, third stage threshold value, first stage reservation value, second stage reservation value, third stage reservation value, whole file judgement threshold value and whole file reservation size are determined in advance by default values. It is, however, desired that a user sets again in the system unit, in the file system unit, in the file unit and the like.
In the file system of this invention, in response to a setting request from a user space, the parameters in the table 601 can be replaced by using the interface 602 between the kernel and user. In response to a reference request from the user space, the current parameter values in the table 601 can be referred by using the interface 602 between the kernel and user. As the interface 602 between the kernel and user, the /proc/sys file system of Linux, ioctl of UNIX (registered trademark) or the like is used.
According to the invention, a file system can be realized which can prevent excessive reservation operations, reduce the process cost of the area release and effectively prevent fragment generation. Accordingly, this file system can be applied widely to information processing apparatuses equipped with a disk storage.
It should be further understood by those skilled in the art that although the foregoing description has been made on embodiments of the invention, the invention is not limited thereto and various changes and modifications may be made without departing from the spirit of the invention and the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2003-369816 | Oct 2003 | JP | national |